db2060 crashed (RAID controller)
Closed, ResolvedPublic

Related Objects

jcrespo created this task.Dec 23 2016, 4:41 PM
Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald TranscriptDec 23 2016, 4:41 PM
RobH added a subscriber: RobH.Dec 23 2016, 4:49 PM

When I first connected to the serial console, it wasn't accepting input, but scrolled the following:

[27858755.642012] INFO: task jbd2/sda1-8:385 blocked for more than 120 seconds.
[27858755.677137] Tainted: G W 3.19.0-2-amd64 #1
[27858755.706546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27858755.747062] INFO: task systemd-journal:435 blocked for more than 120 seconds.
[27858755.783791] Tainted: G W 3.19.0-2-amd64 #1
[27858755.813147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27858755.853801] INFO: task xfsaild/dm-0:617 blocked for more than 120 seconds.
[27858755.889060] Tainted: G W 3.19.0-2-amd64 #1
[27858755.918578] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27858755.958503] INFO: task mysqld:2964 blocked for more than 120 seconds.
[27858755.991749] Tainted: G W 3.19.0-2-amd64 #1
[27858756.021192] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27858756.061603] INFO: task mysqld:2987 blocked for more than 120 seconds.
[27858756.095192] Tainted: G W 3.19.0-2-amd64 #1
[27858756.124653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27858756.165498] INFO: task mysqld:60514 blocked for more than 120 seconds.
[27858756.198438] Tainted: G W 3.19.0-2-amd64 #1
[27858756.227828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27858756.268423] INFO: task rs:main Q:Reg:41811 blocked for more than 120 seconds.
[27858756.304953] Tainted: G W 3.19.0-2-amd64 #1
[27858756.334131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27858756.373961] INFO: task kworker/u129:2:39962 blocked for more than 120 seconds.
[27858756.411161] Tainted: G W 3.19.0-2-amd64 #1
[27858756.440120] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27858756.480110] INFO: task nrpe:47870 blocked for more than 120 seconds.
[27858756.512814] Tainted: G W 3.19.0-2-amd64 #1
[27858756.542579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27858756.582336] INFO: task nrpe:47872 blocked for more than 120 seconds.
[27858756.615410] Tainted: G W 3.19.0-2-amd64 #1
[27858756.644917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27858757.129438] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 7, t=36768 jiffies, g=237869017, c=237869016, q=391)
[27858820.221410] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 15, t=52524 jiffies, g=237869017, c=237869016, q=2300)
[27858883.313372] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 1, t=68283 jiffies, g=237869017, c=237869016, q=2460)

Then I powercycled the system, and it returned back online (had some error output in the OS boot output.)

RobH added a comment.Dec 23 2016, 4:51 PM

On post, it also scrolled past:

1719-Slot 0 Drive Array - A controller failure event occurred prior to thisve power-up. (Previous lock up code = 0x13)

From the OS logs:

Dec 23 16:36:10 db2060 kernel: [    6.793108] ata2.01: failed to resume link (SControl 0)
Dec 23 16:36:10 db2060 kernel: [    6.829120] ata2.00: SATA link down (SStatus 0 SControl 300)
Dec 23 16:36:10 db2060 kernel: [    6.856280] ata2.01: SATA link down (SStatus 4 SControl 0)
Dec 23 16:36:10 db2060 kernel: [    6.882945] ata1.01: failed to resume link (SControl 0)
Dec 23 16:36:10 db2060 kernel: [    6.918873] ata1.00: SATA link down (SStatus 0 SControl 300)
Dec 23 16:36:10 db2060 kernel: [    6.946696] ata1.01: SATA link down (SStatus 4 SControl 0)
HP ProLiant System ROM	08/02/2014
HP ProLiant System ROM - Backup	08/02/2014
HP ProLiant System ROM Bootblock	03/05/2013
HP Smart Array P420i Controller	6.00
iLO	2.03 Nov 07 2014
Power Management Controller Firmware	3.3
Power Management Controller Firmware Bootloader	2.7
SAS Programmable Logic Device	Version 0x0C
Server Platform Services (SPS) Firmware	2.1.7.E7.4
System Programmable Logic Device	Version 0x32
6	 Caution	POST Message	12/23/2016 16:35	12/23/2016 16:35	1	POST Error: 1792-Slot X Drive Array - Valid Data Found in Cache Module. Data will automatically be written to drive array.
5	 Caution	POST Message	12/23/2016 16:35	12/23/2016 16:35	1	POST Error: 1719-A controller failure event occurred prior to this power-up
4	 Critical	Drive Array	12/23/2016 15:49	12/23/2016 15:49	1	Drive Array Controller Failure (Slot 0)
jcrespo renamed this task from db2060 crashed, probably RAID controller to db2060 crashed (RAID controller).Dec 23 2016, 5:42 PM
jcrespo triaged this task as Low priority.Dec 23 2016, 5:45 PM
jcrespo added subscribers: Marostegui, Papaul.

Leaving this open for @Marostegui and @Papaul to see, there is not much else to do except maybe "upgrading the bios" so that next time it happens that cannot be it. Not sure if worth it, this erros only seem to happen once every 6 months among all servers of this type.

jcrespo moved this task from Triage to In progress on the DBA board.Dec 23 2016, 5:45 PM
jcrespo claimed this task.

I haven't checked in much detail, but from the logs it looks like just a controller crash indeed. We can upgrade the BIOS once we have some spare time now that it is easy to do, just getting ahead of HP support just in case this is recurrent on this host (hopefully not!).

We should restart the server anyways so we can probably take advantage of that and upgrade whatever needs some upgrade:

Cache Status Details: The current array controller had valid data stored in its battery/capacitor backed write cache the last time it was reset or was powered up.  This indicates that the system may not have been shut down gracefully.  The array controller has automatically written, or has attempted to write, this data to the drives.  This message will continue to be displayed until the next reset or power-cycle of the array controller.

I have been talking to Papaul and he's kindly agreed to upgrade its BIOS on Thursday, so we will reboot and upgrade it.

Change 330376 had a related patch set uploaded (by Marostegui):
db-codfw.php: Depool db2060

https://gerrit.wikimedia.org/r/330376

Change 330376 merged by jenkins-bot:
db-codfw.php: Depool db2060

https://gerrit.wikimedia.org/r/330376

Mentioned in SAL (#wikimedia-operations) [2017-01-04T10:12:18Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2060 - T154031 (duration: 00m 47s)

@Papaul ping me today once you are around and have time so we can do all the updates and get this ticket over with
Thanks!

Mentioned in SAL (#wikimedia-operations) [2017-01-17T16:22:48Z] <marostegui> Powering off db2060 for maintenance - T154031

Firmware update complete.

After the reboot the Cache looks good now

Cache Status: OK

Going to repool the server for now as it looks stable for the past few weeks.

Mentioned in SAL (#wikimedia-operations) [2017-01-18T08:33:16Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool db2060 - T154031 (duration: 00m 40s)

Marostegui closed this task as Resolved.Jan 18 2017, 8:34 AM
Marostegui claimed this task.