Page MenuHomePhabricator

db2056 RAID controller (temporary) failure
Closed, ResolvedPublic

Description

Unavailable though network, Its virtual console only spits kernel messages:

[14173946.038578] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 16, t=11696154 jiffies, g=137737817, c=137737816, q=71189)
[14174009.131464] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 11, t=11711910 jiffies, g=137737817, c=137737816, q=71395)
[14174072.224362] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 5, t=11727666 jiffies, g=137737817, c=137737816, q=71475)
[14174135.317252] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 25, t=11743422 jiffies, g=137737817, c=137737816, q=71555)
[14174198.410140] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 11, t=11759182 jiffies, g=137737817, c=137737816, q=71645)
[14174261.503032] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 29, t=11774934 jiffies, g=137737817, c=137737816, q=71725)
[14174324.591918] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 15, t=11790690 jiffies, g=137737817, c=137737816, q=71805)
[14174387.688811] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 38, t=11806446 jiffies, g=137737817, c=137737816, q=71885)
[14174450.781706] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 11, t=11822204 jiffies, g=137737817, c=137737816, q=71981)
[14174513.874595] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 16, t=11837959 jiffies, g=137737817, c=137737816, q=72069)
[14174576.967483] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 11, t=11853714 jiffies, g=137737817, c=137737816, q=72149)

Event Timeline

jcrespo created this task.Jul 18 2016, 10:08 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 18 2016, 10:08 AM

Mentioned in SAL [2016-07-18T10:10:41Z] <jynus> hard reset for db2056 T140598

Slot 0  HP Smart Array P420i Controller        (1 GB, v6.00)  1 Logical Drive
1719-Slot 0 Drive Array - A controller failure event occurred prior to this
     power-up.  (Previous lock up code = 0x13)
1792-Slot 0 Drive Array - Valid Data Found in Write-Back Cache.
     Data will automatically be written to drive array.

No mysql-related logs before the freeze:

160524 12:34:13 [Note] Slave I/O thread: connected to master 'repl@db2017.codfw.wmnet:3306',replication starts at GTID position '0-171970567-2955826593'
160718 10:20:01 mysqld_safe Starting mysqld daemon with databases from /srv/sqldata

No syslog/kernel messages.

But I found this on the management logs:

</system1/log1>hpiLO-> show record6

status=0
status_tag=COMMAND COMPLETED
Mon Jul 18 10:34:35 2016



/system1/log1/record6
  Targets
  Properties
    number=6
    severity=Critical
    date=07/17/2016
    time=20:40
    description=Drive Array Controller Failure (Slot 0)
  Verbs
    cd version exit show set
jcrespo renamed this task from db2056 freezed to db2056 RAID controller (temporary) failure.Jul 18 2016, 10:38 AM
jcrespo edited projects, added ops-eqiad; removed DBA.Jul 18 2016, 10:50 AM
Restricted Application added a subscriber: Southparkfan. · View Herald TranscriptJul 18 2016, 10:50 AM
jcrespo closed this task as Resolved.Jul 18 2016, 2:17 PM
jcrespo claimed this task.

Back to normal, we will keep it monitored in case it happens again.