Page MenuHomePhabricator

db2044 HW RAID failure
Closed, ResolvedPublic

Description

db2044 appears to be broken and from the ILO I can see:

[7913506.346241] sd 0:1:0:0: rejecting I/O to offline device
db2044 login:
[7913506.510064] sd 0:1:0:0: rejecting I/O to offline device
db2044 login: root
[7913511.072062] sd 0:1:0:0: rejecting I/O to offline device
[7913511.098193] sd 0:1:0:0: rejecting I/O to offline device
[7913521.235712] sd 0:1:0:0: rejecting I/O to offline device
[7913521.261847] sd 0:1:0:0: rejecting I/O to offline device
[7913521.484802] sd 0:1:0:0: rejecting I/O to offline device
[7913521.734819] sd 0:1:0:0: rejecting I/O to offline device
[7913521.985010] sd 0:1:0:0: rejecting I/O to offline device
[7913522.234647] sd 0:1:0:0: rejecting I/O to offline device
[7913522.262094] sd 0:1:0:0: rejecting I/O to offline device
[7913536.276856] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[7913536.338935] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff9957ae852ac8), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[7913536.401888] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[7913547.689155] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[7913547.750294] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff9957ae852ac8), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[7913547.812872] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[7913550.319915] sd 0:1:0:0: rejecting I/O to offline device

Event Timeline

From the logs:

/system1/log1/record16
  Targets
  Properties
    number=16
    severity=Critical
    date=09/01/2017
    time=06:18
    description=Drive Array Controller Failure (Slot 0)

Mentioned in SAL (#wikimedia-operations) [2017-09-01T07:06:19Z] <marostegui> Power reset db2044 as it is unresponsive - T174764

I have rebooted the server because it was basically unresponsive and it has came back fine apparently. I will do some more checks before starting MySQL

jcrespo renamed this task from db2044 HW issues to db2044 HW RAID failure.Sep 1 2017, 7:16 AM
jcrespo added a project: ops-codfw.
Marostegui claimed this task.

After rebooting the server again, everything looks good again and I see no more HW errors.
I have started mysql and replication and everything is looking ok.

I am going to close this ticket for now, if it happens again I suggest we reopen it and contact HP

Marostegui triaged this task as Medium priority.Sep 1 2017, 10:19 AM

Looks like this server has crashed again for the same reason:

[Tue Sep 26 20:17:42 2017] hpsa 0000:02:00.0: scsi 0:1:0:0: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Tue Sep 26 20:18:03 2017] hpsa 0000:02:00.0: Controller lockup detected: 0xffff0000 after 30
[Tue Sep 26 20:18:03 2017] hpsa 0000:02:00.0: controller lockup detected: LUN:0000004000000000 CDB:01040000000000000000000000000000
[Tue Sep 26 20:18:03 2017] hpsa 0000:02:00.0: Controller lockup detected during reset wait
[Tue Sep 26 20:18:03 2017] hpsa 0000:02:00.0: scsi 0:1:0:0: reset logical  failed Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: Device offlined - not ready after error recovery
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#154 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#154 CDB: Write(16) 8a 00 00 00 00 00 d4 83 a6 00 00 00 02 00 00 00
[Tue Sep 26 20:18:03 2017] blk_update_request: I/O error, dev sda, sector 3565397504
[Tue Sep 26 20:18:03 2017] hpsa 0000:02:00.0: failed 157 commands in fail_all
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] killing request
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#153 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#153 CDB: Write(16) 8a 00 00 00 00 00 02 50 ed 48 00 00 03 c0 00 00
[Tue Sep 26 20:18:03 2017] blk_update_request: I/O error, dev sda, sector 38858056
[Tue Sep 26 20:18:03 2017] Buffer I/O error on dev sda1, logical block 4856873, lost sync page write
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#152 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#152 CDB: Write(16) 8a 00 00 00 00 00 02 f5 5d 00 00 00 00 60 00 00
[Tue Sep 26 20:18:03 2017] blk_update_request: I/O error, dev sda, sector 49634560
[Tue Sep 26 20:18:03 2017] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1873721 (offset 0 size 0 starting block 6204321)
[Tue Sep 26 20:18:03 2017] Buffer I/O error on device sda1, logical block 6203936
[Tue Sep 26 20:18:03 2017] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1873721 (offset 0 size 0 starting block 6204322)
[Tue Sep 26 20:18:03 2017] Buffer I/O error on device sda1, logical block 6203937
[Tue Sep 26 20:18:03 2017] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1873721 (offset 0 size 0 starting block 6204323)
[Tue Sep 26 20:18:03 2017] Buffer I/O error on device sda1, logical block 6203938
[Tue Sep 26 20:18:03 2017] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1873721 (offset 0 size 0 starting block 6204324)
[Tue Sep 26 20:18:03 2017] Buffer I/O error on device sda1, logical block 6203939
[Tue Sep 26 20:18:03 2017] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1873721 (offset 0 size 0 starting block 6204325)
[Tue Sep 26 20:18:03 2017] Buffer I/O error on device sda1, logical block 6203940
[Tue Sep 26 20:18:03 2017] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1873721 (offset 0 size 0 starting block 6204326)
[Tue Sep 26 20:18:03 2017] Buffer I/O error on device sda1, logical block 6203941
[Tue Sep 26 20:18:03 2017] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1873721 (offset 0 size 0 starting block 6204327)
[Tue Sep 26 20:18:03 2017] Buffer I/O error on device sda1, logical block 6203942
[Tue Sep 26 20:18:03 2017] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1873721 (offset 0 size 0 starting block 6204328)
[Tue Sep 26 20:18:03 2017] Buffer I/O error on device sda1, logical block 6203943
[Tue Sep 26 20:18:03 2017] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1873721 (offset 0 size 0 starting block 6204329)
[Tue Sep 26 20:18:03 2017] Buffer I/O error on device sda1, logical block 6203944
[Tue Sep 26 20:18:03 2017] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1873721 (offset 0 size 0 starting block 6204330)
[Tue Sep 26 20:18:03 2017] Buffer I/O error on device sda1, logical block 6203945
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#151 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#151 CDB: Write(16) 8a 00 00 00 00 00 02 f4 23 00 00 00 01 00 00 00
[Tue Sep 26 20:18:03 2017] blk_update_request: I/O error, dev sda, sector 49554176
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#150 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#150 CDB: Write(16) 8a 00 00 00 00 00 02 f4 20 80 00 00 00 80 00 00
[Tue Sep 26 20:18:03 2017] blk_update_request: I/O error, dev sda, sector 49553536
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#149 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#149 CDB: Write(16) 8a 00 00 00 00 00 01 bd 2d d0 00 00 00 80 00 00
[Tue Sep 26 20:18:03 2017] blk_update_request: I/O error, dev sda, sector 29175248
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#148 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Tue Sep 26 20:18:03 2017] sd 0:1:0:0: [sda] tag#148 CDB: Write(16) 8a 00 00 00 00 00 02 f6 2c 00 00 00 0a b0 00 00
[Tue Sep 26 20:18:03 2017] blk_update_request: I/O error, dev sda, sector 49687552
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: [sda] tag#147 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: [sda] tag#147 CDB: Write(16) 8a 00 00 00 00 00 02 f6 24 00 00 00 08 00 00 00
[Tue Sep 26 20:18:04 2017] blk_update_request: I/O error, dev sda, sector 49685504
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: [sda] tag#146 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: [sda] tag#146 CDB: Write(16) 8a 00 00 00 00 00 02 f6 1c 00 00 00 08 00 00 00
[Tue Sep 26 20:18:04 2017] blk_update_request: I/O error, dev sda, sector 49683456
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: [sda] tag#145 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: [sda] tag#145 CDB: Write(16) 8a 00 00 00 00 00 02 f6 14 00 00 00 08 00 00 00
[Tue Sep 26 20:18:04 2017] blk_update_request: I/O error, dev sda, sector 49681408
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] Buffer I/O error on dev dm-0, logical block 162618335, lost async page write
[Tue Sep 26 20:18:04 2017] Buffer I/O error on dev dm-0, logical block 9753422, lost async page write
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] XFS (dm-0): metadata I/O error: block 0x200 ("xfs_buf_iodone_callback_error") error 5 numblks 16
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] Buffer I/O error on dev sda1, logical block 7503956, lost async page write
[Tue Sep 26 20:18:04 2017] Buffer I/O error on dev sda1, logical block 7503882, lost async page write
[Tue Sep 26 20:18:04 2017] Buffer I/O error on dev dm-0, logical block 162618336, lost async page write
[Tue Sep 26 20:18:04 2017] JBD2: Detected IO errors while flushing file data on sda1-8
[Tue Sep 26 20:18:04 2017] Buffer I/O error on dev dm-0, logical block 162618334, lost async page write
[Tue Sep 26 20:18:04 2017] Buffer I/O error on dev dm-0, logical block 2380677, lost async page write
[Tue Sep 26 20:18:04 2017] Buffer I/O error on dev dm-0, logical block 162618337, lost async page write
[Tue Sep 26 20:18:04 2017] Buffer I/O error on dev dm-0, logical block 9753396, lost async page write
[Tue Sep 26 20:18:04 2017] Aborting journal on device sda1-8.
[Tue Sep 26 20:18:04 2017] XFS (dm-0): metadata I/O error: block 0xceed1e00 ("xlog_iodone") error 5 numblks 512
[Tue Sep 26 20:18:04 2017] XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1200 of file /home/zumbi/linux-4.9.13/fs/xfs/xfs_log.c.  Return address = 0xffffffffc0aa02a6
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] XFS (dm-0): Log I/O Error Detected.  Shutting down filesystem
[Tue Sep 26 20:18:04 2017] XFS (dm-0): Please umount the filesystem and rectify the problem(s)
[Tue Sep 26 20:18:04 2017] JBD2: Error -5 detected when updating journal superblock for sda1-8.
[Tue Sep 26 20:18:04 2017] XFS (dm-0): metadata I/O error: block 0xceed2000 ("xlog_iodone") error 5 numblks 512
[Tue Sep 26 20:18:04 2017] XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1200 of file /home/zumbi/linux-4.9.13/fs/xfs/xfs_log.c.  Return address = 0xffffffffc0aa02a6
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] JBD2: Detected IO errors while flushing file data on sda1-8
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:04 2017] EXT4-fs error (device sda1): ext4_journal_check_start:56: Detected aborted journal
[Tue Sep 26 20:18:04 2017] EXT4-fs (sda1): Remounting filesystem read-only
[Tue Sep 26 20:18:04 2017] EXT4-fs (sda1): previous I/O error to superblock detected
[Tue Sep 26 20:18:04 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:16 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:17 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:17 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:18:17 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:07 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:08 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:08 2017] EXT4-fs error (device sda1): ext4_find_entry:1463: inode #1199406: comm bash: reading directory lblock 0
[Tue Sep 26 20:21:08 2017] EXT4-fs (sda1): previous I/O error to superblock detected
[Tue Sep 26 20:21:08 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:08 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:08 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:21:22 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:30 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:30 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:30 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:30 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:30 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:30 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:30 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:30 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:34 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:34 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:34 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:34 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:34 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:34 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:34 2017] sd 0:1:0:0: rejecting I/O to offline device
[Tue Sep 26 20:23:34 2017] sd 0:1:0:0: rejecting I/O to offline device

Mentioned in SAL (#wikimedia-operations) [2017-09-26T20:30:24Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2044 - T174764 (duration: 00m 50s)

I cannot see anything on ILO logs, last entry is from 9th Sept

Mentioned in SAL (#wikimedia-operations) [2017-09-27T06:29:32Z] <marosteg1i> Reboot db2044 after storage failure - T174764

Marostegui added a subscriber: Papaul.

I have rebooted the server via ILO as there were not much I could debug in this status:

[06:28:34] root@db2044:~# df -hT
-bash: /bin/df: Input/output error
[06:29:34] root@db2044:~# reboot
-bash: /usr/sbin/reboot: Input/output error

Once rebooted, the server came back finely.
MySQL is started

I have cleaned up the ILO logs, as there was nothing there from yesterday and sometimes I have seen HP not writing to logs if they were not cleared up before (happened with db2034)

Given that the server is under warranty according to racktables, @Papaul could you open a case with HP to get the RAID controller (I assume) replaced? As it has crashed twice within a month.
I am going to leave the server depooled

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5323289661
Status: Case is generated and in Progress

Product description: HP ProLiant DL380p Gen8 12 LFF Configure-to-order Server
Product number: 665552-B21
Serial number: 2M245205HG
Subject: DL380p Gen8 - Controller failure

Yours sincerely,
Hewlett Packard Enterprise

After reviewing all the logs , HP decide to send a new mainboard because the controller is on the mainboard. The tech will be onsite to perform the replacement on Monday 2nd between 9am and 1 pm.

  • Replace controller
  • update firmware

link to update the firmware
http://h20564.www2.hpe.com/hpsc/swd/public/detail?sp4ts.oid=5295169&swItemId=MTX_d1dacc37a6764efcb644a0fc0a&swEnvOid=4231#tab-history

After reviewing all the logs , HP decide to send a new mainboard because the controller is on the mainboard. The tech will be onsite to perform the replacement on Monday 2nd between 9am and 1 pm.

  • Replace controller
  • update firmware

link to update the firmware
http://h20564.www2.hpe.com/hpsc/swd/public/detail?sp4ts.oid=5295169&swItemId=MTX_d1dacc37a6764efcb644a0fc0a&swEnvOid=4231#tab-history

Awesome!
Thanks for handling this.
I will make sure the server is off by that time.
Will also remove /etc/udev/rules.d/70-persistent-net.rules

@Papaul can you make sure that booting on PXE by default is disable when changing the mainboard so we avoid undesired re-installs? (although it should fail anyways, but better to disable it).

Thanks!

@Marostegui yes I will make sure PXE is disable.

Mentioned in SAL (#wikimedia-operations) [2017-10-02T08:40:44Z] <marostegui> Stop MySQL on db2044 to get it ready to replace its mainboard - T174764

Mentioned in SAL (#wikimedia-operations) [2017-10-02T08:46:30Z] <marostegui> Poweroff db2044 for HW maintenance - T174764

@Papaul server is now off.
Feel free to power it on once you are done with it

Thank you!

Main board replacement complete.

Thank you
I have upgraded the server and as everything looked good, I have started MySQL again.
Also cleaned HW logs so we can start fresh.

MySQL is catching up now. Going to resolve this and we'll see how the hosts behaves - hopefully we'll not need to reopen this ticket anymore!
Thanks for all the help @Papaul

Change 381972 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Repool db2044

https://gerrit.wikimedia.org/r/381972

Change 381972 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Repool db2044

https://gerrit.wikimedia.org/r/381972

Mentioned in SAL (#wikimedia-operations) [2017-10-03T13:04:20Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool db2044 - T174764 (duration: 00m 47s)