Page MenuHomePhabricator

db2069 crashed due to RAID controller
Closed, ResolvedPublic

Description

The IPMI interface is available, connecting from console gives some kernel messages:

[15108735.009963] systemd[1]: systemd-journald.service watchdog timeout (limit 1min)!
[15108760.549599] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 17, t=5488344 jiffies, g=236130750, c=236130749, q=99360)
[15108823.642644] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 11, t=5504100 jiffies, g=236130750, c=236130749, q=99517)
[15108886.735680] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 31, t=5519852 jiffies, g=236130750, c=236130749, q=99680)

This looks like the last time, when the RAID controller got disconnected on another host.

Event Timeline

jcrespo renamed this task from db2069 is down to db2069 crashed due to RAID controller.Jul 29 2016, 6:57 AM

And I was right:

</system1/log1>hpiLO-> show record5

status=0
status_tag=COMMAND COMPLETED
Fri Jul 29 06:35:23 2016



/system1/log1/record5
  Targets
  Properties
    number=5
    severity=Critical
    date=07/29/2016
    time=00:26
    description=Drive Array Controller Failure (Slot 0)
  Verbs
    cd version exit show set

Mentioned in SAL [2016-07-29T06:59:09Z] <jynus> powercycling db2069 T141601

jcrespo claimed this task.

Replication went back to normal and GTID reliable replication avoided corruption- resolving, keeping an eye on the data.