Page MenuHomePhabricator

Degraded RAID on an-worker1235
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (broadcom) was detected on host an-worker1235. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-broadcom
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'an-worker1235', '-c', 'get_raid_status_broadcom']': RETCODE: 2
STDOUT:
communication: 0 OK ; controller: 1 Needs Attention ; physical_disk: 1 Failed ; virtual_disk: 1 OfLn ; bbu: 0 OK ; enclosure: 0 OK ; CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-35-amd64
Controller = 0
Status = Success
Description = Show Drive Group Succeeded


TOPOLOGY :
========

-----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT       Size PDC  PI SED DS3  FSpace TR 
-----------------------------------------------------------------------------
 0 -   -   -        -   RAID1 Optl  N  446.625 GB dflt N  N   dflt N      N  
 0 0   -   -        -   RAID1 Optl  N  446.625 GB dflt N  N   dflt N      N  
 0 0   0   64:12    0   DRIVE Onln  N  446.625 GB dflt N  N   dflt -      N  
 0 0   1   64:13    1   DRIVE Onln  N  446.625 GB dflt N  N   dflt -      N  
 1 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 1 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 1 0   0   64:0     3   DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
 2 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 2 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 2 0   0   64:1     4   DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
 3 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 3 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 3 0   0   64:10    5   DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
 4 -   -   -        -   RAID0 OfLn  N    7.276 TB dflt N  N   dflt N      N  
 4 0   -   -        -   RAID0 Dgrd  N    7.276 TB dflt N  N   dflt N      N  
 5 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 5 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 5 0   0   64:3     7   DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
 6 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 6 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 6 0   0   64:4     8   DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
 7 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 7 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 7 0   0   64:8     10  DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
 8 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 8 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 8 0   0   64:6     11  DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
 9 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 9 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
 9 0   0   64:7     12  DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
10 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
10 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
10 0   0   64:11    13  DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
11 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
11 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
11 0   0   64:9     14  DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
12 -   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
12 0   -   -        -   RAID0 Optl  N    7.276 TB dflt N  N   dflt N      N  
12 0   0   64:5     15  DRIVE Onln  N    7.276 TB dflt N  N   dflt -      N  
-----------------------------------------------------------------------------

DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Optl=Optimal|Dgrd=Degraded
Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
TR=Transport Ready






STDERR:
None

Event Timeline

Jclark-ctr subscribed.

Confirmed: Service Request 216687198 was successfully submitted.

Service request was Denied due to Server out of warranty. Was able to call Dell and open Parts only claim. using 1 year limited warranty claim

I can see from dmesg -T that the drive in question is /dev/sde
It was remounted read-only on Oct 3rd.

[Fri Oct  3 02:12:46 2025] EXT4-fs (sde1): I/O error while writing superblock
[Fri Oct  3 02:12:46 2025] EXT4-fs error (device sde1): ext4_journal_check_start:83: Detected aborted journal
[Fri Oct  3 02:12:46 2025] EXT4-fs (sde1): Remounting filesystem read-only

I was able to unmount it successfully.

btullis@an-worker1235:~$ sudo umount /dev/sde1
btullis@an-worker1235:~$ echo $?
0

So you can go ahead and swap this drive whenever it is convenient, thanks @Jclark-ctr.

@BTullis Failed drive has been replaced! Thanks for the assistance

BTullis claimed this task.

Reopening and assigning to myself, because there is a manual op to do here. I hope this doesn't mess up your team's SLO target things.

I checked the physical disks to see which one needed to be configured:

PD LIST :
=======

----------------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model                Sp Type 
----------------------------------------------------------------------------------
64:0      3 Onln   1   7.276 TB SATA HDD N   N  512B TOSHIBA MG06ACA800EY U  -    
64:1      4 Onln   2   7.276 TB SATA HDD N   N  512B TOSHIBA MG06ACA800EY U  -    
64:2     16 UGood  -   7.276 TB SATA HDD N   N  512B ST8000NM023B-2TJ133  U  -    
64:3      7 Onln   4   7.276 TB SATA HDD N   N  512B TOSHIBA MG06ACA800EY U  -    
64:4      8 Onln   5   7.276 TB SATA HDD N   N  512B TOSHIBA MG06ACA800EY U  -    
64:5     15 Onln  11   7.276 TB SATA HDD N   N  512B TOSHIBA MG06ACA800EY U  -    
64:6     11 Onln   7   7.276 TB SATA HDD N   N  512B TOSHIBA MG06ACA800EY U  -    
64:7     12 Onln   8   7.276 TB SATA HDD N   N  512B TOSHIBA MG06ACA800EY U  -    
64:8     10 Onln   6   7.276 TB SATA HDD N   N  512B TOSHIBA MG06ACA800EY U  -    
64:9     14 Onln  10   7.276 TB SATA HDD N   N  512B ST8000NM023B-2TJ133  U  -    
64:10     5 Onln   3   7.276 TB SATA HDD N   N  512B TOSHIBA MG06ACA800EY U  -    
64:11    13 Onln   9   7.276 TB SATA HDD N   N  512B TOSHIBA MG06ACA800EY U  -    
64:12     0 Onln   0 446.625 GB SATA SSD N   N  512B HFS480G3H2X069N      U  -    
64:13     1 Onln   0 446.625 GB SATA SSD N   N  512B HFS480G3H2X069N      U  -    
----------------------------------------------------------------------------------

I showed and then deleted the preserved cache for the missing drive.

btullis@an-worker1235:~$ sudo perccli64 /c0 show preservedcache
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-35-amd64
Controller = 0
Status = Success
Description = None


-----------------
 VD Size State   
-----------------
235    - Missing 
-----------------


btullis@an-worker1235:~$ sudo perccli64 /c0/v235 delete preservedcache
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-35-amd64
Controller = 0
Status = Success
Description = Virtual Drive preserved Cache Data Cleared.

Then I created the new raid0 volume on this disk.

btullis@an-worker1235:~$ sudo perccli64 /c0 add vd r0 drives='64:2'
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-35-amd64
Controller = 0
Status = Success
Description = Add VD Succeeded.

I verified that it was presented to the operating system. It showed up as /dev/sde on this occasion.

btullis@an-worker1235:~$ sudo dmesg -T|tail
[Fri Oct 10 09:06:26 2025] scsi 0:3:107:0: Direct-Access     DELL     PERC H750 Adp    5.16 PQ: 0 ANSI: 5
[Fri Oct 10 09:06:26 2025] sd 0:3:107:0: Attached scsi generic sg4 type 0
[Fri Oct 10 09:06:26 2025] sd 0:3:107:0: [sde] 15626928128 512-byte logical blocks: (8.00 TB/7.28 TiB)
[Fri Oct 10 09:06:26 2025] sd 0:3:107:0: [sde] Write Protect is off
[Fri Oct 10 09:06:26 2025] sd 0:3:107:0: [sde] Mode Sense: 1f 00 00 08
[Fri Oct 10 09:06:26 2025] sd 0:3:107:0: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[Fri Oct 10 09:06:26 2025] sd 0:3:107:0: [sde] Optimal transfer size 262144 bytes
[Fri Oct 10 09:06:26 2025] sde: detected capacity change from 0 to 8000987201536
[Fri Oct 10 09:06:26 2025] sde: detected capacity change from 0 to 8000987201536
[Fri Oct 10 09:06:26 2025] sd 0:3:107:0: [sde] Attached SCSI disk
btullis@an-worker1235:~$

Created the file system on the new disk.

btullis@an-worker1235:~$ sudo parted /dev/sde --script mklabel gpt
btullis@an-worker1235:~$ sudo parted /dev/sde --script mkpart primary ext4 0% 100%
btullis@an-worker1235:~$ sudo mkfs.ext4 -L hadoop-e /dev/sde1
mke2fs 1.46.2 (28-Feb-2021)
Creating filesystem with 1953365504 4k blocks and 244170752 inodes
Filesystem UUID: 47e09ff6-5a38-4b4f-a5d5-0421823dc8f3
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done       

btullis@an-worker1235:~$ sudo tune2fs -m 0 /dev/sde1
tune2fs 1.46.2 (28-Feb-2021)
Setting reserved blocks percentage to 0% (0 blocks)

Mounted the volume.

btullis@an-worker1235:~$ sudo mount -a
btullis@an-worker1235:~$ lsblk
NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                  8:0    0 446.6G  0 disk 
├─sda1                               8:1    0   953M  0 part /boot
├─sda2                               8:2    0     1K  0 part 
└─sda5                               8:5    0 445.7G  0 part 
  ├─an--worker1235--vg-swap        254:0    0   9.3G  0 lvm  [SWAP]
  ├─an--worker1235--vg-root        254:1    0  55.9G  0 lvm  /
  └─an--worker1235--vg-journalnode 254:2    0    10G  0 lvm  /var/lib/hadoop/journal
sdb                                  8:16   0   7.3T  0 disk 
└─sdb1                               8:17   0   7.3T  0 part /var/lib/hadoop/data/b
sdc                                  8:32   0   7.3T  0 disk 
└─sdc1                               8:33   0   7.3T  0 part /var/lib/hadoop/data/c
sdd                                  8:48   0   7.3T  0 disk 
└─sdd1                               8:49   0   7.3T  0 part /var/lib/hadoop/data/d
sde                                  8:64   0   7.3T  0 disk 
└─sde1                               8:65   0   7.3T  0 part /var/lib/hadoop/data/e
sdf                                  8:80   0   7.3T  0 disk 
└─sdf1                               8:81   0   7.3T  0 part /var/lib/hadoop/data/f
sdg                                  8:96   0   7.3T  0 disk 
└─sdg1                               8:97   0   7.3T  0 part /var/lib/hadoop/data/g
sdh                                  8:112  0   7.3T  0 disk 
└─sdh1                               8:113  0   7.3T  0 part /var/lib/hadoop/data/h
sdi                                  8:128  0   7.3T  0 disk 
└─sdi1                               8:129  0   7.3T  0 part /var/lib/hadoop/data/i
sdj                                  8:144  0   7.3T  0 disk 
└─sdj1                               8:145  0   7.3T  0 part /var/lib/hadoop/data/j
sdk                                  8:160  0   7.3T  0 disk 
└─sdk1                               8:161  0   7.3T  0 part /var/lib/hadoop/data/k
sdl                                  8:176  0   7.3T  0 disk 
└─sdl1                               8:177  0   7.3T  0 part /var/lib/hadoop/data/l
sdm                                  8:192  0   7.3T  0 disk 
└─sdm1                               8:193  0   7.3T  0 part /var/lib/hadoop/data/m