Page MenuHomePhabricator

dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds
Closed, ResolvedPublic

Description

Just saw it crashed a couple of hours ago - creating this for tracking and to investigate

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 28 2017, 10:11 AM

There are a bunch of:

InnoDB: Warning: a long semaphore wait:
--Thread 139834895070976 has waited at dict0stats.cc line 2406 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x144b040 '&dict_operation_lock'
a writer (thread id 139834895070976) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0undo.cc line 299
Last time write locked in file /home/jynus/mariadb-10.0.29/storage/xtradb/dict/dict0stats.cc line 2406

Then the server keeps replicating (although s3 and s6 comoplain about duplicate keys).

Then an hour later the issues start again:

170628  7:56:21 [Note] Event Scheduler: [root@localhost].[ops.wmf_dbstore_delay] event execution failed.
InnoDB: Warning: a long semaphore wait:

And at 8:08 it finally crashes:

InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
2017-06-28 08:08:15 7f2ddc3fd700  InnoDB: Assertion failure in thread 139834945427200 in file srv0srv.cc line 2222
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
170628  8:08:15 [ERROR] mysqld got signal 6 ;
Marostegui renamed this task from dbstore1001 mysql crashed to dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds.Jun 28 2017, 10:41 AM
Marostegui added a comment.EditedJun 28 2017, 10:50 AM

Current status of replication:
s6 is broken with duplicate key on jawiki.watchlist: Error 'Duplicate entry '18049674' for key 'PRIMARY'' on query. Default database: 'jawiki'. Query: 'REPLACE /* WatchedItemStore::duplicateEntry */ INTO watchlist

I have dumped that table under /root before we attempt to fix it

x1 is broken with duplicate key on wikishared.echo_unread_wikis: Could not execute Write_rows_v1 event on table wikishared.echo_unread_wikis; Duplicate entry '50877986-enwiki' for key 'echo_unread_wikis_user_wiki', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY;

I have dumped that table under /root before we attempt to fix it

s6 was fixed by stopping/starting slave.

s6 has complained again with the same issue, a duplicate entry on a REPLACE INTO for the same table: watchlist

yeah, after a crashm, on a non-gtid slave, we can consider the host as broken. We should just migrate backups to dbstore2001.

yeah, after a crashm, on a non-gtid slave, we can consider the host as broken. We should just migrate backups to dbstore2001.

dbstore2001 shards are still catching up from last week's maintenance.
Some of them already caught up.

We can ignore jawiki.watchlist, let it catch up and reimport it. It is a pretty small table (4G)

I do not really have a strong mind of doing or not doing that- I only have clear we should just build those multi-instance shards and forget about the current hosts.

Totally agree that we have to move on.

Exporting and importing the table didn't fix it, so it is really broken.
I will set the filters, let it catch up and reimport it tomorrow. Hopefully it is the only broken table :-)

Mentioned in SAL (#wikimedia-operations) [2017-06-28T15:52:16Z] <marostegui> Temporary ignore jawiki.watchlist table during replication on dbstore1001 - T169050

Mentioned in SAL (#wikimedia-operations) [2017-06-29T07:18:09Z] <marostegui> Disable event scheduler temporarily on dbstore1001 - T169050

Mentioned in SAL (#wikimedia-operations) [2017-06-29T11:58:21Z] <marostegui> Stop replication on the same position for: dbstore1001 (s6) and db1050 - T169050

Mentioned in SAL (#wikimedia-operations) [2017-06-29T12:18:25Z] <marostegui> Re-enable event scheduler on dbstore1001 - T169050

I have reimported jawiki.watchlist from production into dbstore1001, removed the ignore replication filters, enable the normal ones and restarted replication on s6. It is looking good now.

Mentioned in SAL (#wikimedia-operations) [2017-06-29T12:38:56Z] <marostegui> Stop replication on dbstore1002 - x1 - T169050

Marostegui moved this task from Triage to In progress on the DBA board.Jun 29 2017, 1:38 PM
Marostegui closed this task as Resolved.Jun 29 2017, 2:06 PM
Marostegui claimed this task.

I have fixed x1 as well - will close this for now as there is not much we can do. At least the two shards that broke are now fixed.

jcrespo reopened this task as Open.Jul 19 2017, 9:29 PM

dbstore1001 is completely broken. It is okay to close this, but declare "At least the two shards that broke are now fixed." would be a falsehood.

For context, s1 replication broke (and I made it work, but it will not be the only error), meaning it is an unreliable host to take backups from.

dbstore1001 is completely broken. It is okay to close this, but declare "At least the two shards that broke are now fixed." would be a falsehood.

The original shards that failed were s6 and x1, which never failed again. So s1 broke almost 3 weeks after the original crash...
I completely agree that this host is broken and other errors might pop up in the future

Marostegui closed this task as Resolved.Aug 16 2017, 10:28 AM

I am going to close this, as this will only be solved once we have migrated this host to multi-instance.