Page MenuHomePhabricator

Degraded RAID on db2194
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (perccli) was detected on host db2194. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

communication: 0 OK

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'db2194', '-c', 'get_raid_status_perccli']': RETCODE: 2
STDOUT:
communication: 0 OK | controller: 1 Needs Attention | physical_disk: 0 OK | virtual_disk: 1 Dgrd | bbu: 0 OK | enclosure: 0 OK | CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 6.1.0-17-amd64
Controller = 0
Status = Success
Description = Show Drive Group Succeeded


TOPOLOGY :
========

----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type   State BT     Size PDC  PI SED DS3  FSpace TR 
----------------------------------------------------------------------------
 0 -   -   -        -   RAID10 Dgrd  N  8.729 TB dflt N  N   dflt N      N  
 0 0   -   -        -   RAID1  Dgrd  N  8.729 TB dflt N  N   dflt N      N  
 0 0   0   252:0    9   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   1   252:1    8   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   2   252:2    7   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   3   252:3    6   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   4   252:4   

STDERR:
None

Event Timeline

ABran-WMF triaged this task as Medium priority.Feb 8 2024, 3:31 PM
ABran-WMF added a project: DBA.

indeed a disk is reported missing:

arnaudb@db2194:~ $ sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli
communication: 0 OK | controller: 1 Needs Attention | physical_disk: 0 OK | virtual_disk: 1 Dgrd | bbu: 0 OK | enclosure: 0 OK | CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 6.1.0-17-amd64
Controller = 0
Status = Success
Description = Show Drive Group Succeeded


TOPOLOGY :
========

----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type   State BT     Size PDC  PI SED DS3  FSpace TR 
----------------------------------------------------------------------------
 0 -   -   -        -   RAID10 Dgrd  N  8.729 TB dflt N  N   dflt N      N  
 0 0   -   -        -   RAID1  Dgrd  N  8.729 TB dflt N  N   dflt N      N  
 0 0   0   252:0    9   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   1   252:1    8   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   2   252:2    7   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   3   252:3    6   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   4   252:4    5   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   5   -        -   DRIVE  Msng  -  1.745 TB -    -  -   -    -      N  
 0 0   6   252:6    3   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   7   252:7    1   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   8   252:8    0   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
 0 0   9   252:9    2   DRIVE  Onln  N  1.745 TB dflt N  N   dflt -      N  
----------------------------------------------------------------------------

DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Optl=Optimal|Dgrd=Degraded
Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
TR=Transport Ready

Server is less than 3 years old according to netbox, @wiki_willy could we please have a replacement ? :-)

@ABran-WMF I need to do some troubleshooting measures before Dell will replace the disk. is it safe for me to power down the server? It shouldn't be long and I can get the request out today.

as expected the drive did not come back with their recommended troubleshooting. Created a dispatch. SR184935290. Will notify when the disk is replaced.

Disk has been replaced. It appears in the idrac.

everything's back to normal:

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli
communication: 0 OK | controller: 0 OK | physical_disk: 0 OK | virtual_disk: 0 OK | bbu: 0 OK | enclosure: 0 OK