Page MenuHomePhabricator

db2125 crashed
Closed, ResolvedPublic

Description

On Saturday 23th (times in UTC):

08:28:25 <+icinga-wm> PROBLEM - SSH on db2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
08:31:39 <+icinga-wm> RECOVERY - SSH on db2125 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
08:38:41 <+icinga-wm> PROBLEM - SSH on db2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring

Event Timeline

Marostegui added projects: DBA, SRE.

Mentioned in SAL (#wikimedia-operations) [2019-11-25T05:58:14Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2125 - crashed T239042', diff saved to https://phabricator.wikimedia.org/P9726 and previous config saved to /var/cache/conftool/dbconfig/20191125-055813-marostegui.json

There are no hardware logs:

/admin1-> racadm getsel
Record:      1
Date/Time:   07/12/2019 21:38:11
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------

The first traces of crash are:

Nov 23 08:25:35 db2125 mysqld[13682]: InnoDB: Warning: a long semaphore wait:
Nov 23 08:25:35 db2125 mysqld[13682]: --Thread 139387736135424 has waited at row0purge.cc line 772 for 241.00 seconds the semaphore:
Nov 23 08:25:35 db2125 mysqld[13682]: S-lock on RW-latch at 0x558a1450ad40 '&dict_operation_lock'
Nov 23 08:25:35 db2125 mysqld[13682]: a writer (thread id 139387752920832) has reserved it in mode  exclusive
Nov 23 08:25:35 db2125 mysqld[13682]: number of readers 0, waiters flag 1, lock_word: 0
Nov 23 08:25:35 db2125 mysqld[13682]: Last time read locked in file row0purge.cc line 772
Nov 23 08:25:35 db2125 mysqld[13682]: Last time write locked in file dict0stats.cc line 2532
Nov 23 08:25:35 db2125 mysqld[13682]: Holder thread 0 file not yet reserved line 0
Nov 23 08:25:35 db2125 mysqld[13682]: InnoDB: Warning: semaphore wait:
Nov 23 08:25:35 db2125 mysqld[13682]: --Thread 139387736135424 has waited at row0purge.cc line 772 for 241.00 seconds the semaphore:
Nov 23 08:25:35 db2125 mysqld[13682]: S-lock on RW-latch at 0x558a1450ad40 '&dict_operation_lock'
Nov 23 08:25:35 db2125 mysqld[13682]: a writer (thread id 139387752920832) has reserved it in mode  exclusive
Nov 23 08:25:35 db2125 mysqld[13682]: number of readers 0, waiters flag 1, lock_word: 0
Nov 23 08:25:35 db2125 mysqld[13682]: Last time read locked in file row0purge.cc line 772
Nov 23 08:25:35 db2125 mysqld[13682]: Last time write locked in file dict0stats.cc line 2532
Nov 23 08:25:35 db2125 mysqld[13682]: Holder thread 0 file not yet reserved line 0
Nov 23 08:25:35 db2125 mysqld[13682]: InnoDB: Warning: semaphore wait:

So it might be related to a storage crash.

More logs from the console:

[10086760.709402] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [kworker/u480:0:6]
[10086764.636175] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [sshd:107965]
[10086764.688053] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [sshd:107788]
[10086768.662715] NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [sshd:107781]

Mentioned in SAL (#wikimedia-operations) [2019-11-25T06:18:35Z] <marostegui> racadm serveraction hardreset on db2125 T239042

Nothing apart from this on OS logs:

Nov 23 08:17:09 db2125 systemd[1]: Started Time & Date Service.
Nov 23 08:18:01 db2125 CRON[107127]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Nov 23 08:19:01 db2125 CRON[107326]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Nov 23 08:20:01 db2125 CRON[107475]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^'

I have extracted the controller logs....nothing showing up there.

Marostegui added a project: ops-codfw.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added a subscriber: Papaul.

@Papaul can we upgrade firwmare and BIOS on this host? It is a very new host, and if it this crash happens again we might need to contact Dell.

Change 552695 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2125: Disable notifications

https://gerrit.wikimedia.org/r/552695

Change 552695 merged by Marostegui:
[operations/puppet@production] db2125: Disable notifications

https://gerrit.wikimedia.org/r/552695

For what is worth, this is the kernel this host is running at the moment (I have not upgraded it since the crash):

root@db2125:~# uname -a
Linux db2125 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20) x86_64 GNU/Linux

complete
Before
BIOS Version 2.2.11
iDRAC Firmware Version 3.34.34.34

After
BIOS Version 2.4.7
iDRAC Firmware Version 3.36.36.36

Thank you Papaul. I will start MySQL and do a data consistency check.

Mentioned in SAL (#wikimedia-operations) [2019-11-25T17:36:21Z] <marostegui> Upgrade kernel on db2125 T239042

Kernel upgraded and host rebooted:

root@db2125:~# uname -a
Linux db2125 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux

Mentioned in SAL (#wikimedia-operations) [2019-11-26T06:51:17Z] <marostegui> Run compare.py for db2125 - T239042

Change 552971 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2125: Enable notifications

https://gerrit.wikimedia.org/r/552971

Change 552971 merged by Marostegui:
[operations/puppet@production] db2125: Enable notifications

https://gerrit.wikimedia.org/r/552971

Mentioned in SAL (#wikimedia-operations) [2019-11-27T05:48:10Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2125 T239042', diff saved to https://phabricator.wikimedia.org/P9759 and previous config saved to /var/cache/conftool/dbconfig/20191127-054809-marostegui.json

All the wikis have had their main tables checked and there is non apparent data drifts, so I am going to repool this host and consider this fixed for now.
If it happens again, we can reopen this.