Page MenuHomePhabricator

Degraded RAID on ms-be2036
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be2036. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0] sdb1[1](F)
      58559488 blocks super 1.2 [2/1] [U_]
      
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSun, Oct 11, 10:30 AM

Icinga downtime for 4:00:00 set by filippo@cumin1001 on 1 host(s) and their services with reason: reboot

ms-be2036.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2020-10-12T07:54:48Z] <godog> reboot ms-be2036 - T265208

The host rebooted into Linux OK, however there were error messages at boot. Looks like related to both ilo and the hw raid firmware.

Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V4.52) 14 Logical
Drive(s) - Operation Failed
 - 1719-Slot 3 Drive Array - A controller failure event occurred prior
   to this power-up.  (Previous lock up code = 0x13) Action: Install the
   latest controller firmware. If the problem persists, replace the
   controller.
Redundant ROM Detected - This system contains a valid backup system ROM
HPE SmartMemory authenticated in all populated DIMM slots.

329-Power Management Controller FW Error - Unable to communicate with the FW.
Action: Reset iLO FW. If issue persists reset Power Management Controller FW
(remove AC). If issue persists attempt to update the FW.
312-HPE Smart Storage Battery 1 Failure - Communication with the battery
failed. Its output may not be enabled.
Action: Verify battery is properly installed. Refer to user guide. Contact HPE
support if condition persists.
`
333-HPE RESTful API Error - Unable to communicate with iLO FW. BIOS
configuration resources may not be up-to-date.
Action: Reset iLO FW and reboot the server. If issue persists, AC power cycle
the server.
fgiunchedi assigned this task to Papaul.Mon, Oct 12, 8:20 AM
fgiunchedi added a subscriber: Papaul.

@Papaul I've updated the hw raid firmware to 6.88 and rebooted to apply the upgrade. On reboot the message below was still there, what do you think ? Feel free to upgrade the ilo firmware too, the host can be powered down (from Linux) at any time. Thank you!

Starting drivers. Please wait, this may take a few moments....
333-HPE RESTful API Error - Unable to communicate with iLO FW. BIOS
configuration resources may not be up-to-date.
Action: Reset iLO FW and reboot the server. If issue persists, AC power cycle
the server.
Joe added a subscriber: Joe.Mon, Oct 12, 2:57 PM

I just want to comment that this server had its root directory filled up today, and it's in a strange state where only 13 GB are found by du -xsh /, but 53 are occupied on /dev/md0 according to df. Given there are no huge deleted files I can see, it seems possible the server has some leftover data under /srv on the root partition that is now overwritten by the mountpoints.

Thanks! Indeed that's what happened, I've unmounted the filesystems and delete the files

Also the system is sending root mails since ~15:10.

Cron <swift@ms-be2036>   test -x /usr/bin/swift-recon-cron && test -r /etc/swift/object-server.conf && /usr/bin/swift-recon-cron /etc/swift/object-server.conf
[Errno 13] Permission denied: '/var/lock/swift-recon-object-cron'

Looks like /var/lock has wrong permissions:

  File: /var/lock/
  Size: 80              Blocks: 0          IO Block: 4096   directory
Device: 13h/19d Inode: 10998       Links: 3
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-10-12 16:43:23.932934567 +0000
Modify: 2020-10-12 15:07:05.734375123 +0000
Change: 2020-10-12 15:07:05.734375123 +0000
 Birth: -

Mentioned in SAL (#wikimedia-operations) [2020-10-12T17:03:03Z] <jayme> fixed /var/lock/ permission (1777) on ms-be2036 - T265208

I could not find any evidence that this was a intentional change, so I fixed the permissions.

I spotted a problem with sdd in dmesg too, perhaps that disk isn't healthy

[100075.068371] sd 0:1:0:3: [sdd] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[100075.068375] sd 0:1:0:3: [sdd] tag#13 Sense Key : Medium Error [current] 
[100075.068377] sd 0:1:0:3: [sdd] tag#13 Add. Sense: Unrecovered read error
[100075.068380] sd 0:1:0:3: [sdd] tag#13 CDB: Read(16) 88 00 00 00 00 00 00 56 8a 30 00 00 00 10 00 00
[100075.068382] blk_update_request: critical medium error, dev sdd, sector 5671472
[100075.101521] XFS (sdd1): metadata I/O error: block 0x568230 ("xfs_trans_read_buf_map") error 61 numblks 16
[100075.145107] XFS (sdd1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -61.
[100075.469664] sd 0:1:0:3: [sdd] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[100075.469667] sd 0:1:0:3: [sdd] tag#9 Sense Key : Medium Error [current] 
[100075.469670] sd 0:1:0:3: [sdd] tag#9 Add. Sense: Unrecovered read error
[100075.469673] sd 0:1:0:3: [sdd] tag#9 CDB: Read(16) 88 00 00 00 00 00 00 56 8a 30 00 00 00 10 00 00
[100075.469675] blk_update_request: critical medium error, dev sdd, sector 5671472
[100075.503518] XFS (sdd1): metadata I/O error: block 0x568230 ("xfs_trans_read_buf_map") error 61 numblks 16
[100075.546992] XFS (sdd1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -61.
[100075.547279] sd 0:1:0:3: [sdd] tag#20 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[100075.547282] sd 0:1:0:3: [sdd] tag#20 Sense Key : Medium Error [current] 
[100075.547285] sd 0:1:0:3: [sdd] tag#20 Add. Sense: Unrecovered read error
[100075.547288] sd 0:1:0:3: [sdd] tag#20 CDB: Read(16) 88 00 00 00 00 00 00 56 8a 30 00 00 00 10 00 00
[100075.547289] blk_update_request: critical medium error, dev sdd, sector 5671472
[100075.580289] XFS (sdd1): metadata I/O error: block 0x568230 ("xfs_trans_read_buf_map") error 61 numblks 16
[100075.623913] XFS (sdd1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -61.
Papaul triaged this task as Medium priority.Tue, Oct 13, 12:43 PM

Plus these usb recurring messages in dmesg

[105275.802560] usb 3-3: USB disconnect, device number 13
[105276.254453] usb 3-3: new high-speed USB device number 14 using xhci_hcd
[105276.394569] usb 3-3: New USB device found, idVendor=0424, idProduct=2660
[105276.394571] usb 3-3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[105276.394876] hub 3-3:1.0: USB hub found
[105276.394932] hub 3-3:1.0: 2 ports detected
[105326.147616] usb 3-3: USB disconnect, device number 14
[105326.599559] usb 3-3: new high-speed USB device number 15 using xhci_hcd
[105326.743669] usb 3-3: New USB device found, idVendor=0424, idProduct=2660
[105326.743672] usb 3-3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[105326.743937] hub 3-3:1.0: USB hub found
[105326.743992] hub 3-3:1.0: 2 ports detected
[105391.503873] usb 3-3: USB disconnect, device number 15
[105391.951809] usb 3-3: new high-speed USB device number 16 using xhci_hcd
[105392.091909] usb 3-3: New USB device found, idVendor=0424, idProduct=2660
[105392.091911] usb 3-3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[105392.092205] hub 3-3:1.0: USB hub found
[105392.092265] hub 3-3:1.0: 2 ports detected
[105441.844980] usb 3-3: USB disconnect, device number 16
[105442.312926] usb 3-3: new high-speed USB device number 17 using xhci_hcd
[105442.464995] usb 3-3: New USB device found, idVendor=0424, idProduct=2660
[105442.464997] usb 3-3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[105442.465286] hub 3-3:1.0: USB hub found
[105442.465341] hub 3-3:1.0: 2 ports detected
[105492.198086] usb 3-3: USB disconnect, device number 17
[105492.646027] usb 3-3: new high-speed USB device number 18 using xhci_hcd
[105492.786119] usb 3-3: New USB device found, idVendor=0424, idProduct=2660
[105492.786121] usb 3-3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[105492.786389] hub 3-3:1.0: USB hub found
[105492.786448] hub 3-3:1.0: 2 ports detected

@fgiunchedi

Embedded Flash/SD-CARD 		Controller firmware revision 2.10.00 Embedded media manager failed media attach

Upgrade ILO from 2.50 to 2.74

The Flash/SD-CARD problem was fixed by formatting the NAND and draining the power

Embedded Flash/SD-CARD 		Controller firmware revision 2.10.00

@fgiunchedi looks like icinga is happy now

MD RAID
	
View Extra Service Notes
	OK 	2020-10-13 17:51:33 	0d 0h 3m 51s 	1/3 	OK: Active: 4, Working: 4, Failed: 0, Spare: 0
fgiunchedi added a comment.EditedWed, Oct 14, 7:48 AM

Thanks @Papaul! It looks like the sdd disk is in trouble, do you have a spare or can order one? Thank you!

The disk is physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 4000.7 GB, OK)

[45698.678655] sd 0:1:0:6: [sdg] tag#31 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[45698.678658] sd 0:1:0:6: [sdg] tag#31 Sense Key : Medium Error [current] 
[45698.678661] sd 0:1:0:6: [sdg] tag#31 Add. Sense: Unrecovered read error
[45698.678664] sd 0:1:0:6: [sdg] tag#31 CDB: Read(16) 88 00 00 00 00 00 26 02 b8 00 00 00 01 00 00 00
[45698.678666] blk_update_request: critical medium error, dev sdg, sector 637712384
[45708.505945] hpsa 0000:08:00.0: scsi 0:1:0:6: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap- En- Exp=1
[45714.923894] hpsa 0000:08:00.0: scsi 0:1:0:6: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap- En- Exp=1
[45714.929785] sd 0:1:0:6: [sdg] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[45714.929790] sd 0:1:0:6: [sdg] tag#24 Sense Key : Medium Error [current] 
[45714.929794] sd 0:1:0:6: [sdg] tag#24 Add. Sense: Unrecovered read error
[45714.929800] sd 0:1:0:6: [sdg] tag#24 CDB: Read(16) 88 00 00 00 00 00 26 02 b9 00 00 00 01 00 00 00
[45714.929803] blk_update_request: critical medium error, dev sdg, sector 637712640
[47966.681844] sd 0:1:0:3: [sdd] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[47966.681848] sd 0:1:0:3: [sdd] tag#17 Sense Key : Medium Error [current] 
[47966.681850] sd 0:1:0:3: [sdd] tag#17 Add. Sense: Unrecovered read error
[47966.681853] sd 0:1:0:3: [sdd] tag#17 CDB: Read(16) 88 00 00 00 00 00 00 56 8a 30 00 00 00 10 00 00
[47966.681855] blk_update_request: critical medium error, dev sdd, sector 5671472
[47966.714762] XFS (sdd1): metadata I/O error: block 0x568230 ("xfs_trans_read_buf_map") error 61 numblks 16
[47966.758825] XFS (sdd1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -61.
[47967.293215] sd 0:1:0:3: [sdd] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[47967.293218] sd 0:1:0:3: [sdd] tag#23 Sense Key : Medium Error [current] 
[47967.293220] sd 0:1:0:3: [sdd] tag#23 Add. Sense: Unrecovered read error
[47967.293223] sd 0:1:0:3: [sdd] tag#23 CDB: Read(16) 88 00 00 00 00 00 00 56 8a 30 00 00 00 10 00 00
[47967.293225] blk_update_request: critical medium error, dev sdd, sector 5671472
[47967.326714] XFS (sdd1): metadata I/O error: block 0x568230 ("xfs_trans_read_buf_map") error 61 numblks 16
[47967.370174] XFS (sdd1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -61.
[47967.382388] sd 0:1:0:3: [sdd] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[47967.382391] sd 0:1:0:3: [sdd] tag#23 Sense Key : Medium Error [current] 
[47967.382393] sd 0:1:0:3: [sdd] tag#23 Add. Sense: Unrecovered read error
[47967.382396] sd 0:1:0:3: [sdd] tag#23 CDB: Read(16) 88 00 00 00 00 00 00 56 8a 30 00 00 00 10 00 00
[47967.382398] blk_update_request: critical medium error, dev sdd, sector 5671472
[47967.415672] XFS (sdd1): metadata I/O error: block 0x568230 ("xfs_trans_read_buf_map") error 61 numblks 16
[47967.460865] XFS (sdd1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -61.

@fgiunchedi yes we do have 3 spares. Will replace when back on site tomorrow.

Papaul reassigned this task from Papaul to fgiunchedi.Thu, Oct 15, 2:48 PM

@fgiunchedi disk replaced

fgiunchedi closed this task as Resolved.Thu, Oct 15, 3:01 PM

Disk is rebuilding! Thank you @Papaul