Page MenuHomePhabricator

Urgent: Two failed disks in ms-be2040
Closed, ResolvedPublic

Description

Hi,

Two drives have failed in ms-be2040 - /dev/sdh and /dev/sdn. Can they be replaced ASAP, please?

/dev/sdh
lshw -C disk tells us nothing as the device is AWOL, but scsi@0:2.8.0 is absent, so it must be that one.
Similarly Target Id: 8 and Virtual Drive: 8 are absent from megacli -ldpdinfo -a0 output, as is Slot Number: 6, so I infer that slot contains the failed drive; I've attempted to turn on the locator on that device, but it failed:

mvernon@ms-be2040:~$ sudo megacli -PDLocate -PhysDrv [32:6] -a0
                                     
Adapter 0: Device at Enclosure - 32, Slot - 6 is not found.

Exit Code: 0x01

/dev/sdn
lshw -C disk tells us scsi@0:2.13.0
megacli -ldpdinfo -a0 tells us Target Id: 13 is associated with
Enclosure Device ID: 32, Slot Number: 11
I've endeavoured to turn on the locator light thus:
sudo megacli -PDLocate -PhysDrv [32:11] -a0

Event Timeline

KOfori renamed this task from Two failed disks in ms-be2040 to Urgent: Two failed disks in ms-be2040.Mon, Mar 13, 2:26 PM
wiki_willy added subscribers: Jclark-ctr, RobH, wiki_willy.

Hi @Papaul - just a heads up that this one is out of warranty, but @RobH is working on purchasing more spares after the testing in T329305. Thanks, Willy

@wiki_willy thank you for the heads up. @MatthewVernon i checked the system, it lookslike the system is seeing all the 14 disks. Is it possible to see maybe a reboot or a re-image can fix the issue? Thank you

 	 	Physical Disk 0:1:0	Online	0	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:1	Online	1	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:2	Online	2	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:3	Online	3	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:4	Online	4	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:5	Online	5	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:7	Online	7	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:8	Online	8	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:9	Online	9	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:10	Online	10	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:11	Online	11	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Solid State Disk 0:1:12	Non-RAID	12	446.63 GB	Encryption Capable	SATA	SSD	No	
99%
 	 	Solid State Disk 0:1:13	Non-RAID	13	446.63 GB	Encryption Capable	SATA	SSD	No	
99%

@Papaul that's only 13 disks, not 14?
The recent activity panel in the iDRAC shows:

2023-03-12T15:08:42-0500	Virtual Disk 8 on Integrated RAID Controller 1 was deleted.
2023-03-12T15:08:37-0500	The Patrol Read operation was stopped and did not complete for Integrated RAID Controller 1.
2023-03-12T15:08:37-0500	Controller cache is preserved for missing or offline Virtual Disk 8 on Integrated RAID Controller 1.
2023-03-12T15:08:37-0500	Disk 6 in Backplane 1 of Integrated RAID Controller 1 is removed.
2023-03-12T15:08:37-0500	Disk 6 in Backplane 1 of Integrated RAID Controller 1 was reset.
2023-03-12T15:08:37-0500	Disk 6 in Backplane 1 of Integrated RAID Controller 1 is not functioning correctly. 
2023-03-12T15:08:26-0500	Disk 6 in Backplane 1 of Integrated RAID Controller 1 was reset.

I tried rebooting, but now the system won't boot up at all, saying

There are offline or missing virtual drives with preserved cache.
Please check the cables and ensure all drives are present.
Press any key to enter the configuration utility.

I set the Controller BIOS mode to Pause on Errors (rather than stop on errors), that didn't help, so I set it to "ignore errors", which is obviously not ideal.

After reboot, I remounted sdn, although I expect it will fail again in the near future, here are some kernel logs from before I unmounted it at the weekend.

Mar 12 15:13:19 ms-be2040 kernel: [12276007.787416] XFS (sdn1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x32419180 len 8 error 5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.794615] XFS (sdn1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x6fbf3c8 len 8 error 5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.799525] XFS (sdn1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x32419180 len 8 error 5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.811714] XFS (sdn1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x111f268e8 len 8 error 5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.830357] XFS (sdn1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0xce17130 len 8 error 5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.835430] XFS (sdn1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x111f268e8 len 8 error 5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.846999] XFS (sdn1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0xce17130 len 8 error 5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.847100] XFS (sdn1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x75a09a50 len 8 error 5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.851507] XFS (sdn1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x1aa610708 len 8 error 5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.851627] XFS (sdn1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x1aa610708 len 8 error 5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.852170] XFS (sdn1): log I/O error -5
Mar 12 15:13:19 ms-be2040 kernel: [12276007.852307] XFS (sdn1): xfs_do_force_shutdown(0x2) called from line 1211 of file fs/xfs/xfs_log.c. Return address = 00000000d359fe17

So it would be good to replace these two drives, please?

Since this is out of warranty, the pending purchase of 5 disks was raised to 7 on T331988 to accommodate this repair.

@MatthewVernon you right i didn't read disk 6.I will see if i can find disks from old ms-be*

@MatthewVernon i repalced 6. let me which other one is having issues

 	 	Physical Disk 0:1:0	Online	0	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:1	Online	1	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:2	Online	2	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:3	Online	3	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:4	Online	4	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:5	Online	5	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:6	Ready	6	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:7	Online	7	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:8	Online	8	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:9	Online	9	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:10	Online	10	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Physical Disk 0:1:11	Online	11	3725.50 GB	Not Capable	SATA	HDD	No	
Not Applicable
 	 	Solid State Disk 0:1:12	Non-RAID	12	446.63 GB	Encryption Capable	SATA	SSD	No	
99%
 	 	Solid State Disk 0:1:13	Non-RAID	13	446.63 GB	Encryption Capable	SATA	SSD	No	
99%

@Papaul thanks; the other drive has behaved itself since the reboot, so I think we're OK to leave it in place for now.

[obviously it will now fail again tomorrow, but...]