db2044 HW RAID failure
Closed, ResolvedPublic

Description

db2044 appears to be broken and from the ILO I can see:

[7913506.346241] sd 0:1:0:0: rejecting I/O to offline device
db2044 login:
[7913506.510064] sd 0:1:0:0: rejecting I/O to offline device
db2044 login: root
[7913511.072062] sd 0:1:0:0: rejecting I/O to offline device
[7913511.098193] sd 0:1:0:0: rejecting I/O to offline device
[7913521.235712] sd 0:1:0:0: rejecting I/O to offline device
[7913521.261847] sd 0:1:0:0: rejecting I/O to offline device
[7913521.484802] sd 0:1:0:0: rejecting I/O to offline device
[7913521.734819] sd 0:1:0:0: rejecting I/O to offline device
[7913521.985010] sd 0:1:0:0: rejecting I/O to offline device
[7913522.234647] sd 0:1:0:0: rejecting I/O to offline device
[7913522.262094] sd 0:1:0:0: rejecting I/O to offline device
[7913536.276856] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[7913536.338935] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff9957ae852ac8), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[7913536.401888] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[7913547.689155] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[7913547.750294] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff9957ae852ac8), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[7913547.812872] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[7913550.319915] sd 0:1:0:0: rejecting I/O to offline device
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, Sep 1, 6:59 AM

From the logs:

/system1/log1/record16
  Targets
  Properties
    number=16
    severity=Critical
    date=09/01/2017
    time=06:18
    description=Drive Array Controller Failure (Slot 0)

Mentioned in SAL (#wikimedia-operations) [2017-09-01T07:06:19Z] <marostegui> Power reset db2044 as it is unresponsive - T174764

I have rebooted the server because it was basically unresponsive and it has came back fine apparently. I will do some more checks before starting MySQL

jcrespo renamed this task from db2044 HW issues to db2044 HW RAID failure.Fri, Sep 1, 7:16 AM
jcrespo added a project: ops-codfw.
Restricted Application added a project: Operations. · View Herald TranscriptFri, Sep 1, 7:16 AM
Marostegui closed this task as Resolved.Fri, Sep 1, 10:18 AM
Marostegui claimed this task.

After rebooting the server again, everything looks good again and I see no more HW errors.
I have started mysql and replication and everything is looking ok.

I am going to close this ticket for now, if it happens again I suggest we reopen it and contact HP

Marostegui triaged this task as Normal priority.Fri, Sep 1, 10:19 AM