Page MenuHomePhabricator

db2078 m1 mysqld process crashed
Closed, ResolvedPublic

Description

The 26th at around 2AM, db2078 started to suffer some contention errors and MySQL process crashed in the end:

Dec 26 02:04:08 db2078 mysqld[2493]: --Thread 140072766662400 has waited at srv0srv.cc line 2016 for 241.00 seconds the semaphore:
Dec 26 02:04:08 db2078 mysqld[2493]: X-lock on RW-latch at 0x557269cd8230 created in file dict0dict.cc line 833
Dec 26 02:04:08 db2078 mysqld[2493]: a writer (thread id 140072845604608) has reserved it in mode  exclusive
Dec 26 02:04:08 db2078 mysqld[2493]: number of readers 0, waiters flag 1, lock_word: 0
Dec 26 02:04:08 db2078 mysqld[2493]: Last time write locked in file row0mysql.cc line 3342
Dec 26 02:04:08 db2078 mysqld[2493]: 2020-12-26  2:04:08 0 [Note] InnoDB: A semaphore wait:
Dec 26 02:04:08 db2078 mysqld[2493]: --Thread 140070106650368 has waited at buf0buf.cc line 6229 for 0.00 seconds the semaphore:
Dec 26 02:04:08 db2078 mysqld[2493]: Mutex at 0x55726c3e88a0, Mutex BUF_POOL created buf0buf.cc:1877, lock var 2
Dec 26 02:04:08 db2078 mysqld[2493]: 2020-12-26  2:04:08 0 [Note] InnoDB: A semaphore wait:
Dec 26 02:04:08 db2078 mysqld[2493]: --Thread 140072775055104 has waited at dict0stats_bg.cc line 373 for 234.00 seconds the semaphore:
Dec 26 02:04:08 db2078 mysqld[2493]: Mutex at 0x557269cd8200, Mutex DICT_SYS created dict0dict.cc:824, lock var 2

It finally crashed:

Dec 26 02:16:00 db2078 mysqld[2493]: 2020-12-26  2:16:00 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
Dec 26 02:16:00 db2078 mysqld[2493]: 201226  2:16:00 [ERROR] mysqld got signal 6 ;
[Sat Dec 26 02:15:27 2020] mysqld[2535]: segfault at 0 ip 0000557269553909 sp 00007f6541473310 error 6 in mysqld[557268d14000+8e4000]
[Sat Dec 26 02:15:27 2020] Code: 00 00 41 c7 45 00 00 00 00 00 48 8b 75 c0 4c 89 e2 8b 7d cc e8 58 1b 7c ff 49 89 c7 49 39 c4 74 5f e8 ab 18 00 00 41 8b 4d 00 <89> 08 85 c9 74 19 49 83 ff ff 0f 84 c7 01 00 00 4d 85 ff 75 24 83

Looks like there was an hour before the crash

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Dec 28, 3:49 PM

I have started replication there for now as some basic checks didn't reveal any corruption.
Replication is catching up

@jcrespo as this is the codfw host we use for backups, maybe we can think of rebuilding it from the master to be fully sure everything is fine? (just m1)

Marostegui triaged this task as Medium priority.Mon, Jan 4, 6:32 AM
Marostegui moved this task from Triage to In progress on the DBA board.
jcrespo claimed this task.Mon, Jan 11, 9:11 AM

Mentioned in SAL (#wikimedia-operations) [2021-01-12T17:09:24Z] <jynus> shutting down db2132, db2078:m1 for m1 codfw replica reprovisioning T270877

db2078 has been reprovisioned, but some finishing steps and checks may be needed tomorrow to check things are working as expected/db2078 is back into regular production access.

jcrespo closed this task as Resolved.Wed, Jan 13, 12:14 PM
jcrespo reassigned this task from jcrespo to Marostegui.

Instance m1 on db2078 has been reloaded with data from its master + dumping grants reloaded. Old version is still available on /srv/sqldata.m1.bak. Replication seems to be flowing normally. Will reopen if Monday's backups fail for some reason. Assigning to manuel to recognize he the first person to respond to the actual crash (this was a team effort :-)).