Page MenuHomePhabricator

dbstore1002 Mysql errors
Closed, ResolvedPublic

Description

MySQL errored

190114  0:20:53 [ERROR] Master 's8': Slave SQL: Error 'Got error 175 "File too short; Expected more data in file" from storage engine Aria' on query. Default database: 'wikidatawiki'. Query: 'INSERT /* ORES\Storage\SqlScoreStorage::storeScores  */ IGNORE INTO `ores_classification` (oresc_rev,oresc_model,oresc_class,oresc_probability,oresc_is_predicted) VALUES ('833966611','11','1','0.12988314844986','0'),('833966611','12','1','0.99990210692433','1')', Gtid 171978778-171978778-1728518754, Internal MariaDB error code: 1030
190114  0:20:53 [Warning] Master 's8': Slave: Data truncated for column 'oresc_probability' at row 1 Error_code: 1265
190114  0:20:53 [Warning] Master 's8': Slave: Got error 175 "File too short; Expected more data in file" from storage engine Aria Error_code: 1030
190114  0:20:53 [ERROR] Master 's8': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1071-bin.008041' position 191299182
190114  0:20:53 [ERROR] Master 's7': Slave SQL: Error 'Incorrect key file for table 'linter'; try to repair it' on query. Default database: 'eswiki'. Query: 'DELETE /* MediaWiki\Linter\Database::setForPage  */ FROM `linter` WHERE linter_id = '35618088'', Gtid 171978767-171978767-3162612550, Internal MariaDB error code: 1034
190114  0:20:53 [Warning] Master 's7': Slave: Incorrect key file for table 'linter'; try to repair it Error_code: 1034
190114  0:20:53 [ERROR] Master 's7': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1062-bin.001887' position 179851086
190114  0:20:54 [ERROR] Master 's6': Slave SQL: Error 'Incorrect key file for table 'linter'; try to repair it' on query. Default database: 'frwiki'. Query: 'DELETE /* MediaWiki\Linter\Database::setForPage  */ FROM `linter` WHERE linter_id = '94823788'', Gtid 171974883-171974883-782216379, Internal MariaDB error code: 1034
190114  0:20:54 [Warning] Master 's6': Slave: Incorrect key file for table 'linter'; try to repair it Error_code: 1034
190114  0:20:54 [ERROR] Master 's6': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1061-bin.003038' position 885675418
190114  0:20:54 [ERROR] Master 's4': Slave SQL: Error 'Incorrect key file for table 'linter'; try to repair it' on query. Default database: 'commonswiki'. Query: 'DELETE /* MediaWiki\Linter\Database::setForPage  */ FROM `linter` WHERE linter_id = '264683363'', Gtid 171978775-171978775-3972586867, Internal MariaDB error code: 1034
190114  0:20:54 [Warning] Master 's4': Slave: Incorrect key file for table 'linter'; try to repair it Error_code: 1034
190114  0:20:54 [ERROR] Master 's4': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1068-bin.002379' position 676607087
190114  0:20:54 [ERROR] Master 's1': Slave SQL: Error 'Incorrect key file for table 'ores_classification'; try to repair it' on query. Default database: 'enwiki'. Query: 'DELETE /* ORES\Storage\SqlScoreStorage::purgeRows  */ FROM `ores_classification` WHERE oresc_rev IN ('873775278','847452659','873775278','873775278','873775281','873775280','873775282','857441417','873775280','873775280','873775283','873775286','873775284','873775285','873775287','873775287','873775285','873775285','873775288','873775289','873775290','873775288','873775288','873775291','873693082','867478084','873775291','873775291','873775293','873775292','873775294','777550844','858544826','861754557','873775292','873775292','873775294','873775294','873775295','873775296','873775296','873775296','873775298','873775299','873775300','873775301','873775297','873775302','873775298','873775298','873775302')  AND (oresc_model NOT IN (36, 56))', Gtid 171974720-171974720-893955765, Internal MariaDB error code: 1034
190114  0:20:54 [Warning] Master 's1': Slave: Incorrect key file for table 'ores_classification'; try to repair it Error_code: 1034
190114  0:20:54 [ERROR] Master 's1': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1067-bin.002001' position 1044584825
190114  0:21:01 [ERROR] Master 's2': Slave SQL: Error 'Incorrect key file for table 'linter'; try to repair it' on query. Default database: 'zhwiki'. Query: 'DELETE /* MediaWiki\Linter\Database::setForPage  */ FROM `linter` WHERE linter_id IN ('24489510','24489511','24489512')', Gtid 171966574-171966574-1051922471, Internal MariaDB error code: 1034
190114  0:21:01 [Warning] Master 's2': Slave: Incorrect key file for table 'linter'; try to repair it Error_code: 1034
190114  0:21:01 [ERROR] Master 's2': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1066-bin.000676' position 942097716
190114  0:21:13 [ERROR] Master 's3': Slave SQL: Error 'Incorrect key file for table 'linter'; try to repair it' on query. Default database: 'mediawikiwiki'. Query: 'DELETE /* MediaWiki\Linter\Database::setForPage  */ FROM `linter` WHERE linter_id IN ('533503','533504','533505')', Gtid 171966669-171966669-3531452725, Internal MariaDB error code: 1034
190114  0:21:13 [Warning] Master 's3': Slave: Incorrect key file for table 'linter'; try to repair it Error_code: 1034
190114  0:21:13 [ERROR] Master 's3': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1075-bin.003613' position 526682548
190114  0:27:11 [ERROR] Master 's5': Slave SQL: Error 'Incorrect key file for table 'linter'; try to repair it' on query. Default database: 'dewiki'. Query: 'DELETE /* MediaWiki\Linter\Database::setForPage  */ FROM `linter` WHERE linter_id = '3964750'', Gtid 171978777-171978777-1429163023, Internal MariaDB error code: 1034
190114  0:27:11 [Warning] Master 's5': Slave: Incorrect key file for table 'linter'; try to repair it Error_code: 1034
190114  0:27:11 [ERROR] Master 's5': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1070-bin.002393' position 685293454
190114  2:02:24 [ERROR] mysqld: Aria engine: checkpoint failed
190114  3:09:56 [ERROR] mysqld: Aria engine: checkpoint failed
190114  3:10:27 [ERROR] mysqld: Aria engine: checkpoint failed
190114  3:10:59 [ERROR] mysqld: Aria engine: checkpoint failed

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I have repaired and analyzed all the affected tables reported, but replication is still complaining about it.
Also, I can do a select on the tables without any problem.

zhwiki
+---------------+---------+----------+----------+
| Table         | Op      | Msg_type | Msg_text |
+---------------+---------+----------+----------+
| zhwiki.linter | analyze | status   | OK       |
+---------------+---------+----------+----------+
mediawikiwiki
+----------------------+---------+----------+----------+
| Table                | Op      | Msg_type | Msg_text |
+----------------------+---------+----------+----------+
| mediawikiwiki.linter | analyze | status   | OK       |
+----------------------+---------+----------+----------+
commonswiki
+--------------------+---------+----------+----------+
| Table              | Op      | Msg_type | Msg_text |
+--------------------+---------+----------+----------+
| commonswiki.linter | analyze | status   | OK       |
+--------------------+---------+----------+----------+
dewiki
+---------------+---------+----------+----------+
| Table         | Op      | Msg_type | Msg_text |
+---------------+---------+----------+----------+
| dewiki.linter | analyze | status   | OK       |
+---------------+---------+----------+----------+
frwiki
+---------------+---------+----------+----------+
| Table         | Op      | Msg_type | Msg_text |
+---------------+---------+----------+----------+
| frwiki.linter | analyze | status   | OK       |
+---------------+---------+----------+----------+
eswiki
+---------------+---------+----------+----------+
| Table         | Op      | Msg_type | Msg_text |
+---------------+---------+----------+----------+
| eswiki.linter | analyze | status   | OK       |
+---------------+---------+----------+----------+

I have tried to convert the table to InnoDB, and this also fails:

root@dbstore1002.eqiad.wmnet[wikidatawiki]> alter table linter engine=InnoDB;
ERROR 1025 (HY000): Error on rename of './wikidatawiki/linter' to './wikidatawiki/#sql2-9441-3d2123e' (errno: 1 "Operation not permitted")

Executed bmc-device --debug --cold-reset in localhost since the mgmt interface was not available ("No more sessions available").

racadm getsel for today (remember that one disk was already failed, we have a task about it):

-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/14/2019 07:48:10
Source:      system
Severity:    Critical
Description: Fault detected on drive 2 in disk drive bay 1.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   01/14/2019 07:48:35
Source:      system
Severity:    Non-Critical
Description: Fan 5 RPM is less than the lower warning threshold.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   01/14/2019 07:48:35
Source:      system
Severity:    Critical
Description: Fan 5 RPM is less than the lower critical threshold.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   01/14/2019 07:48:39
Source:      system
Severity:    Critical
Description: Fan redundancy is lost.
-------------------------------------------------------------------------------

I wanted to drop and rebuild the tables but I cannot even drop it:

root@dbstore1002.eqiad.wmnet[eswiki]> drop table linter;
ERROR 1030 (HY000): Got error 1 "Operation not permitted" from storage engine Aria
root@dbstore1002.eqiad.wmnet[eswiki]> rename table linter to TO_DROP_linter;
ERROR 1025 (HY000): Error on rename of './eswiki/linter' to './eswiki/TO_DROP_linter' (errno: 1 "Operation not permitted")

The FS doesn't seem to be RO:

root@dbstore1002:/srv/sqldata/eswiki# touch test
root@dbstore1002:/srv/sqldata/eswiki# rm test
root@dbstore1002:/srv/sqldata/eswiki#
elukey renamed this task from dbstore1002 crashed to dbstore1002 Mysql errors.Jan 14 2019, 8:12 AM
elukey updated the task description. (Show Details)

A mysql stop didn't work, neither a kill to mysqld, I had to do kill -9 to mysqld so mysqld_safe has restarted the process. It is starting up now...I will report back

190114 09:23:00 mysqld_safe Number of processes running now: 0
190114 09:23:00 mysqld_safe mysqld restarted
190114  9:23:00 [Note] /opt/wmf-mariadb10/bin/mysqld (mysqld 10.0.22-MariaDB) starting as process 19655 ...
2019-01-14 09:23:00 7fbedee377c0 InnoDB: Warning: Using innodb_locks_unsafe_for_binlog is DEPRECATED. This option may be removed in future releases. Please use READ COMMITTED transaction isolation level instead, see http://dev.mysql.com/doc/refman/5.6/en/set-transaction.html.
190114  9:23:00 [Note] InnoDB: Using mutexes to ref count buffer pool pages
190114  9:23:00 [Note] InnoDB: The InnoDB memory heap is disabled
190114  9:23:00 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
190114  9:23:00 [Note] InnoDB: Memory barrier is not used
190114  9:23:00 [Note] InnoDB: Compressed tables use zlib 1.2.8
190114  9:23:00 [Note] InnoDB: Using CPU crc32 instructions
190114  9:23:00 [Note] InnoDB: Initializing buffer pool, size = 18.0G
190114  9:23:01 [Note] InnoDB: Completed initialization of buffer pool
190114  9:23:01 [Note] InnoDB: Highest supported file format is Barracuda.
190114  9:23:01 [Note] InnoDB: Log scan progressed past the checkpoint lsn 97955068646220
190114  9:23:01 [Note] InnoDB: Database was not shutdown normally!
190114  9:23:01 [Note] InnoDB: Starting crash recovery.
190114  9:23:01 [Note] InnoDB: Reading tablespace information from the .ibd files...

MySQL is failing to start:

190114  9:32:13 [ERROR] mysqld: Aria recovery failed. Please run aria_chk -r on all Aria tables and delete all aria_log.######## files
190114  9:32:13 [ERROR] Plugin 'Aria' init function returned error.
190114  9:32:13 [ERROR] Plugin 'Aria' registration as a STORAGE ENGINE failed.
190114  9:32:13 [Note] Plugin 'FEEDBACK' is disabled.
Mon Jan 14 09:32:14 2019 TokuFT recovery starting in env /srv/sqldata/
Mon Jan 14 09:32:14 2019 TokuFT recovery scanning backward from 720877479758
Mon Jan 14 09:32:14 2019 TokuFT recovery bw_end_checkpoint at 720877479758 timestamp 1547457726104672 xid 720877479754 (bw_newer)
Mon Jan 14 09:32:14 2019 TokuFT recovery bw_begin_checkpoint at 720877479754 timestamp 1547457726104474 (bw_between)
Mon Jan 14 09:32:14 2019 TokuFT recovery turning around at begin checkpoint 720877479754 time 198
Mon Jan 14 09:32:14 2019 TokuFT recovery starts scanning forward to 720877479758 from 720877479754 left 4 (fw_between)
Mon Jan 14 09:32:14 2019 TokuFT recovery closing 2 dictionaries
Mon Jan 14 09:32:14 2019 TokuFT recovery making a checkpoint
Mon Jan 14 09:32:14 2019 TokuFT recovery done
190114  9:32:14 [Note] CONNECT: Version 1.03.0007 October 20, 2015
190114  9:32:14 [ERROR] Aria engine is not enabled or did not start. The Aria engine must be enabled to continue as mysqld was configured with --with-aria-tmp-tables
190114  9:32:14 [ERROR] Aborting

Mentioned in SAL (#wikimedia-operations) [2019-01-14T09:37:25Z] <marostegui> Running aria_chk for all linter tables on dbstore1002 - T213670

Mentioned in SAL (#wikimedia-operations) [2019-01-14T09:51:47Z] <marostegui> Running aria_chk for all myisam tables on dbstore1002 T213670

Marostegui triaged this task as High priority.

Update for all the users of dbstore1002:

We are still not sure what happened to dbstore1002, but now mysql does not start due to some issues with Aria based tables. A lot of them (from 2014->2017) are in the staging db, so not only replicated wikis. The Data Persistence team is trying to run a repair tool (aria_chk) on all the Aria tables as "safe" attempt to fix the startup errors. Mysql is currently down and not usable. There is not clear ETA since the aria_chk tool might take minutes to hours to complete.

Fixing all aria tables has finished.
MySQL has started correctly and same for replication.
The host is now lagging behind as it has been down for a while but it will eventually catch up:

root@cumin1001:~# mysql.py -hdbstore1002 -e "show all slaves status\G" | egrep -i "Connection|Seconds"
              Connection_name: s1
        Seconds_Behind_Master: 50537
              Connection_name: s2
        Seconds_Behind_Master: 50329
              Connection_name: s3
        Seconds_Behind_Master: 50319
              Connection_name: s4
        Seconds_Behind_Master: 50342
              Connection_name: s5
        Seconds_Behind_Master: 49820
              Connection_name: s6
        Seconds_Behind_Master: 50299
              Connection_name: s7
        Seconds_Behind_Master: 50323
              Connection_name: s8
        Seconds_Behind_Master: 50304
              Connection_name: x1
        Seconds_Behind_Master: 19212

I am going to create a task to convert all those tables to InnoDB to avoid similar things with Aria in the future.
Going to leave this open for a few more hours to make sure it doesn't crash again

Update for all the users of dbstore1002:

mysql is again up after some hours of downtime, usable but still catching up with replication.

let's figure out whether we need to back up the staging database.

I am closing this as replication keeps catching up nicely without any errors

root@cumin1001:/home/marostegui# mysql.py -hdbstore1002 -e "show all slaves status\G" | egrep -i "Connection|Seconds"
              Connection_name: s1
        Seconds_Behind_Master: 53249
              Connection_name: s2
        Seconds_Behind_Master: 15950
              Connection_name: s3
        Seconds_Behind_Master: 24001
              Connection_name: s4
        Seconds_Behind_Master: 36069
              Connection_name: s5
        Seconds_Behind_Master: 0
              Connection_name: s6
        Seconds_Behind_Master: 6443
              Connection_name: s7
        Seconds_Behind_Master: 24606
              Connection_name: s8
        Seconds_Behind_Master: 36743
              Connection_name: x1
        Seconds_Behind_Master: 0

I actually had a query on dbstore1002 that had been running on Wikidata for 25 hours as of 23:30 UTC yesterday (T210807#4876169). Am I right to say that didn't cause this issue? 😟

In T213670#4878956, @Neil_P._Quinn_WMF wrote:

I actually had a query on dbstore1002 that had been running on Wikidata for 25 hours as of 23:30 UTC yesterday (T210807#4876169). Am I right to say that didn't cause this issue? 😟

@elukey @Marostegui I would like to know your opinion on this, so I know whether I should be more aggressive killing my own queries next time!

dbstore1002 crashed last night again - this time, the mysqld process crashed.

@Neil_P._Quinn_WMF how does your query look like?

dbstore1002 is in a very bad state, we need to hurry up with its replacement

Mentioned in SAL (#wikimedia-operations) [2019-01-16T09:16:16Z] <marostegui> Stop replication in sync on dbstore1002:x1 and db2034 - T213670

I have re-imported wikishared.echo_unread_wikis as it was failing due to inconsistencies

More errors appeared:

Last_SQL_Error: Error 'Can't lock file (errno: 22 "Invalid argument")' on query. Default database: 'outreachwiki'. Query: 'INSERT /* ManualLogEntry::insert  */ IGNORE INTO `log_search` (ls_field,ls_value,ls_log_id) VALUES ('associated_rev_id','xx','xx')'

And:

Last_SQL_Error: Error 'Got error 22 "Invalid argument" from storage engine TokuDB' on query. Default database: 'outreachwiki'. Query: 'SAVEPOINT `wikimedia_rdbms_atomic254`'

For all the tables on that particular database.
I converted all the tables to InnoDB and restarted replication.

dbstore1002 crashed last night again - this time, the mysqld process crashed.

@Neil_P._Quinn_WMF how does your query look like?

dbstore1002 is in a very bad state, we need to hurry up with its replacement

I posted my query in T210807#4876169. You don't need to worry about me getting the data; I rewrote it yesterday into something much more performant and got all the data without trouble. I just wanted to mention it here in case it's important to the debugging. However, the fact that problems started happening again without my monster query suggests strongly it was unrelated 😁

Thank you for all this your work fighting this fire! If these problems continue and there needs to be an emergency acceleration of the dbstore1002 migration timeline, I'm sure my team can adapt.

Another crash:

InnoDB: Warning: a long semaphore wait:
--Thread 139957469411072 has waited at dict0stats.cc line 2391 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x1420640 '&dict_operation_lock'
a writer (thread id 139957469411072) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0purge.cc line 768
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/dict/dict0stats.cc line 2391
InnoDB: Warning: a long semaphore wait:
--Thread 139966395524864 has waited at row0undo.cc line 298 for 242.00 seconds the semaphore:
S-lock on RW-latch at 0x1420640 '&dict_operation_lock'
a writer (thread id 139957469411072) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0purge.cc line 768
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/dict/dict0stats.cc line 2391
InnoDB: Warning: semaphore wait:
--Thread 139957469411072 has waited at dict0stats.cc line 2391 for 242.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x1420640 '&dict_operation_lock'
a writer (thread id 139957469411072) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0purge.cc line 768
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/dict/dict0stats.cc line 2391
InnoDB: Warning: Writer thread is waiting this semaphore:
--Thread 139957469411072 has waited at dict0stats.cc line 2391 for 242.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x1420640 '&dict_operation_lock'
a writer (thread id 139957469411072) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0purge.cc line 768
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/dict/dict0stats.cc line 2391
InnoDB: Warning: semaphore wait:
--Thread 139966395524864 has waited at row0undo.cc line 298 for 242.00 seconds the semaphore:
S-lock on RW-latch at 0x1420640 '&dict_operation_lock'
a writer (thread id 139957469411072) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0purge.cc line 768
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/dict/dict0stats.cc line 2391
InnoDB: Warning: Writer thread is waiting this semaphore:
--Thread 139957469411072 has waited at dict0stats.cc line 2391 for 242.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x1420640 '&dict_operation_lock'
a writer (thread id 139957469411072) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0purge.cc line 768
Last time write locked in file /home/jynus/mariadb-10.0.22/storage/xtradb/dict/dict0stats.cc line 2391
InnoDB: Warning: semaphore wait:
--Thread 139976546338560 has waited at row0undo.cc line 298 for 232.00 seconds the semaphore:
S-lock on RW-latch at 0x1420640 '&dict_operation_lock'
a writer (thread id 139957469411072) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff

<snip>
Server version: 10.0.22-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=42
max_threads=252
thread_count=24
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 684383 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
mysys/stacktrace.c:247(my_print_stacktrace)[0xbdd6ee]
sql/signal_handler.cc:153(handle_fatal_signal)[0x73dc40]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f4f73580330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f4f72394c37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f4f72398028]
srv/srv0srv.cc:2200(srv_error_monitor_thread)[0x9870aa]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7f4f73578184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f4f7245c03d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
190117 16:25:12 mysqld_safe Number of processes running now: 0

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.0.22-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=42
max_threads=252
thread_count=24
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 684383 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
mysys/stacktrace.c:247(my_print_stacktrace)[0xbdd6ee]
sql/signal_handler.cc:153(handle_fatal_signal)[0x73dc40]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f4f73580330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f4f72394c37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f4f72398028]
srv/srv0srv.cc:2200(srv_error_monitor_thread)[0x9870aa]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7f4f73578184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f4f7245c03d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
190117 16:25:12 mysqld_safe Number of processes running now: 0
190117 16:25:12 mysqld_safe mysqld restarted

Replication flowing, I am glad we migrated most of those MyISAM tables away, so far it is working fine

Mentioned in SAL (#wikimedia-operations) [2019-01-19T08:42:12Z] <marostegui> Fixing dbstore1002 x1 replication T213670

Table frwiki.echo_notification (x1) might need reimporting:

Last_SQL_Error: Could not execute Update_rows_v1 event on table frwiki.echo_notification; Can't find record in 'echo_notification', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1069-bin.000314, end_log_pos 800908454

I have fixed all the missing rows and replication on x1 is catching up now.

And now MySQL process has crashed.

After all the crashes, MySQL was able to start at around 10:18:11 (UTC). @elukey start replication on all slaves at around 12:07:56 (UTC). x1 replication was broken and I fixed lots of rows from 12:38:10 (UTC) till 13:25:11 (UTC) at this point x1, caught up with the master and it is replicating finely.

Another crash happened last night

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
mysys/stacktrace.c:247(my_print_stacktrace)[0xbdd6ee]
sql/signal_handler.cc:153(handle_fatal_signal)[0x73dc40]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fe0261f7330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fe02500bc37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fe02500f028]
srv/srv0srv.cc:2200(srv_error_monitor_thread)[0x9870aa]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fe0261ef184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe0250d303d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
190122 00:19:05 mysqld_safe Number of processes running now: 0
190122 00:19:05 mysqld_safe mysqld restarted
190122  0:19:06 [Note] /opt/wmf-mariadb10/bin/mysqld (mysqld 10.0.22-MariaDB) starting as process 18005 ...
2019-01-22 00:19:06 7efde9a2e7c0 InnoDB: Warning: Using innodb_locks_unsafe_for_binlog is DEPRECATED. This option may be removed in future releases. Please use READ COMMITTED transaction isolation level instead, see http://dev.mysql.com/doc/refman/5.6/en/set-transaction.html.
190122  0:19:06 [Note] InnoDB: Using mutexes to ref count buffer pool pages
190122  0:19:06 [Note] InnoDB: The InnoDB memory heap is disabled
190122  0:19:06 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
190122  0:19:06 [Note] InnoDB: Memory barrier is not used
190122  0:19:06 [Note] InnoDB: Compressed tables use zlib 1.2.8
190122  0:19:06 [Note] InnoDB: Using CPU crc32 instructions
190122  0:19:06 [Note] InnoDB: Initializing buffer pool, size = 18.0G
190122  0:19:07 [Note] InnoDB: Completed initialization of buffer pool
190122  0:19:07 [Note] InnoDB: Highest supported file format is Barracuda.
190122  0:19:07 [Note] InnoDB: Log scan progressed past the checkpoint lsn 99856990038248
190122  0:19:07 [Note] InnoDB: Database was not shutdown normally!
190122  0:19:07 [Note] InnoDB: Starting crash recovery.
190122  0:19:07 [Note] InnoDB: Reading tablespace information from the .ibd files...
190122  0:27:20 [Note] InnoDB: Restoring possible half-written data pages
190122  0:27:20 [Note] InnoDB: from the doublewrite buffer...
InnoDB: Doing recovery: scanned up to log sequence number 99856995280896
InnoDB: Doing recovery: scanned up to log sequence number 99857000523776
InnoDB: Doing recovery: scanned up to log sequence number 99857005766656
InnoDB: Doing recovery: scanned up to log sequence number 99857011009536
InnoDB: Doing recovery: scanned up to log sequence number 99857016252416

All the replication threads but x1 started fine.
I have fixed all the x1 rows that failed and it has now caught up

Mentioned in SAL (#wikimedia-operations) [2019-01-23T14:55:09Z] <marostegui> Compress InnoDB on a few tables on dbstore1002 to gain some extra space - T213670

Mentioned in SAL (#wikimedia-operations) [2019-01-23T15:20:14Z] <marostegui> Truncate wmf_checksum table on dbstore1002 - T213670

Mentioned in SAL (#wikimedia-operations) [2019-01-26T03:23:54Z] <marostegui> Convert all tables on incubatorwiki to innodb to fix s3 thread - T213670

s3 thread broke with:

             Last_SQL_Error: Error 'Got error 22 "Invalid argument" from storage engine TokuDB' on query. Default database: 'incubatorwiki'. Query: 'INSERT /* ActorMigration::getInsertValuesWithTempTable  */ INTO `revision_actor_temp` (revactor_rev,revactor_actor,revactor_timestamp,revactor_page) VALUES xxxxx
Replicate_Ignore_Server_Ids:

I fixed it converting that table to InnoDB. Then the page table failed too, so I am converting all the tables on incubatorwiki to InnoDB.

Change 486712 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore.my.cnf.erb: Make InnoDB default

https://gerrit.wikimedia.org/r/486712

All the tables on incubatorwiki are now InnoDB and replication is catching up.

Mentioned in SAL (#wikimedia-operations) [2019-01-27T03:28:25Z] <marostegui> Fix x1 on dbstore1002 - T213670

It broke again on echo_unread_wikis, I fixed it.

Change 486712 merged by Marostegui:
[operations/puppet@production] dbstore.my.cnf.erb: Make InnoDB default

https://gerrit.wikimedia.org/r/486712

And after 4 days trying to alter mep_word_persistence dbstore1002 crashed again (T213706#4917915)

Another crash, had to alter all the mediawikiwiki database's tables to InnoDB to restart s3 replication. x1 still broken due to a missing row, but it needs to read from the binlog of the x1 master and I have never done it :)

You did it in the last all hands! :-)
I will walk you thru it so you can fix it yourself entirely!

You did it in the last all hands! :-)
I will walk you thru it so you can fix it yourself entirely!

Sure but you showed to me the command, and I kinda recall how it looks like but I don't feel to test my memories after a 12h flight on a sunday evening :D

An attempt to run mydumper for T210478 on dbstore1002 made it crash.

Mentioned in SAL (#wikimedia-operations) [2019-02-05T09:11:22Z] <marostegui> Start all slaves on dbstore1002 - T213670

@elukey @jcrespo Any objection to put dbstore1002 as IDEMPOTENT?
This host crashes every single day, the data is already drifts a lot from production and this host will be depooled in a matter of weeks, it doesn't make any sense that we keep spending time to fix. Specially on x1.
If we feel x1 is absolutely necessary, we should re-import it again (although I am not 100% sure dbstore1002 wouldn't crash during the re-import)

ok to me, data is already garbage, more garbage would not be a problem :-)

TIL Idempotent :)

Good for me, today it was really tedious to go through all those errors, and since we already have the new hosts in place I am in favor.

Mentioned in SAL (#wikimedia-operations) [2019-02-07T12:42:14Z] <marostegui> Set dbstore1002 as IDEMPOTENT - T213670

Change 488920 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore.my.cnf: Make the slave IDEMPOTENT

https://gerrit.wikimedia.org/r/488920

sqoop for actor and comment tables just finished and we should use the new hardware next month, ,so no problem fir me either :)

Change 488920 merged by Marostegui:
[operations/puppet@production] dbstore.my.cnf: Make the slave IDEMPOTENT

https://gerrit.wikimedia.org/r/488920

Change 489147 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore.my.cnf: Enable automatic slaves start

https://gerrit.wikimedia.org/r/489147

Change 489147 merged by Marostegui:
[operations/puppet@production] dbstore.my.cnf: Enable automatic slaves start

https://gerrit.wikimedia.org/r/489147

Another crash just happened:

InnoDB: We intentionally crash the server, because it appears to be hung.
2019-02-10 09:54:08 7fa2cfffe700  InnoDB: Assertion failure in thread 140337251084032 in file srv0srv.cc line 2200
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
190210  9:54:08 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.0.22-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=53
max_threads=252
thread_count=27
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 684383 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
mysys/stacktrace.c:247(my_print_stacktrace)[0xbdd6ee]
sql/signal_handler.cc:153(handle_fatal_signal)[0x73dc40]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fa7dcd09330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fa7dbb1dc37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fa7dbb21028]
srv/srv0srv.cc:2200(srv_error_monitor_thread)[0x9870aa]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fa7dcd01184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fa7dbbe503d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
190210 09:54:21 mysqld_safe Number of processes running now: 0
190210 09:54:21 mysqld_safe mysqld restarted

MySQL is now doing recovering.

Change 489450 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore.my.cnf: Remove the second skip-slave-start

https://gerrit.wikimedia.org/r/489450

Change 489450 merged by Marostegui:
[operations/puppet@production] dbstore.my.cnf: Remove the second skip-slave-start

https://gerrit.wikimedia.org/r/489450

For what is worth, dbstore1002 is now lagging behind on s8 (wikidatawiki) 7 days and it keeps lagging, I doubt it will ever catch up.
Yesterday the migration to dbstore1003-1005 of the staging database happened (T210478#4963411), so everyone should start using that one as soon as possible, specially after seeing so many crashes, lags that will never recover and corrupted data (due to the above crashes)

Marostegui lowered the priority of this task from High to Low.Feb 19 2019, 10:39 AM

Reducing priority as the errors on dbstore1002 are not too important anymore as this host shouldn't be used anymore and everything using it should migrate to the new hosts T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5]

MySQL will be stopped the 4th of March as a final part of the deprecation of this host.
It has been on read only since 18th Feb anyways. The data should not be trusted anymore as it is very corrupted as a result of all the crashes it has had.
I am closing this as this host will no longer have support, if mysql crashes it will get restarted and replication started automatically with idempotent mode (T213670#4934489)

mysql crashed last night:

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
mysys/stacktrace.c:247(my_print_stacktrace)[0xbdd6ee]
sql/signal_handler.cc:153(handle_fatal_signal)[0x73dc40]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f77a198a330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f77a079ec37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f77a07a2028]
srv/srv0srv.cc:2200(srv_error_monitor_thread)[0x9870aa]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7f77a1982184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f77a086603d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
190224 03:20:23 mysqld_safe Number of processes running now: 0
190224 03:20:23 mysqld_safe mysqld restarted

MySQL just crashed again:

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
mysys/stacktrace.c:247(my_print_stacktrace)[0xbdd6ee]
sql/signal_handler.cc:153(handle_fatal_signal)[0x73dc40]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f1735326330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f173413ac37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f173413e028]
srv/srv0srv.cc:2200(srv_error_monitor_thread)[0x9870aa]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7f173531e184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f173420203d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
190227 06:22:44 mysqld_safe Number of processes running now: 0
190227 06:22:44 mysqld_safe mysqld restarted

dbstore1002 just crashed:

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
mysys/stacktrace.c:247(my_print_stacktrace)[0xbdd6ee]
sql/signal_handler.cc:153(handle_fatal_signal)[0x73dc40]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fc05e319330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fc05d12dc37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fc05d131028]
srv/srv0srv.cc:2200(srv_error_monitor_thread)[0x9870aa]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fc05e311184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc05d1f503d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
190301 08:59:09 mysqld_safe Number of processes running now: 0
190301 08:59:09 mysqld_safe mysqld restarted
190301  8:59:09 [Note] /opt/wmf-mariadb10/bin/mysqld (mysqld 10.0.22-MariaDB) starting as process 24694 ...