Page MenuHomePhabricator

helium (bacula) - Device not healthy -SMART-
Closed, DuplicatePublic

Description

helium, the bacula server, shows up in Icinga with a SMART not healthy alert:

cluster=misc device=megaraid,10 instance=helium:9100 job=node site=eqiad

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=helium&service=Device+not+healthy+-SMART-

Event Timeline

also see T189801 and T196478 which are about setting up backup1001 to replace helium

herron triaged this task as High priority.Oct 2 2018, 5:22 PM

also T206004

unfortunately it still says the RAID is "partially degraded" while the Icinga SMART alert has recovered.

@Dzahn the disk was replaced but it's unconfigured good ....I have not tried to add it back but no success. can you give it a go please

I don't know how to do that. How did you try it? Are there maybe docs or examples how that is usually done?

I followed http://erikimh.com/megacli-cheatsheet/ to do so

and

megacli -PdReplaceMissing -PhysDrv [15:9] -Array0 -row9 -a0
                                     
Adapter: 0: Failed to replace Missing PD at Array 0, Row 9.

FW error description: 
  The specified physical drive does not have the appropriate attributes to complete the requested command.  

Exit Code: 0x26

Which had me wondering what on earth and then I found https://www.thomas-krenn.com/de/wiki/MegaCLI_Error_Messages which says

0x26 Unable to use SATA(SAS) drive to replace SAS(SATA)

and sure enough

megacli -PDList -aALL | grep 'PD Type'
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SATA
PD Type: SAS
PD Type: SAS

@Cmjohnson where did that disk come from ?

The disk was a spare...i didn't even look to see that it was a SATA disk.
This server is out of warranty and we'll need to buy 4TB SAS disks

Maybe it makes sense to prioritize T196478 instead?

Maybe it makes sense to prioritize T196478 instead?

That's what we 've being down up to now more or less. But it doesn't look good either timewise. See T203827 (I 'll add it as a blocker on T196478)

I see.. hmm. yea, then we should buy a replacement disk.

@akosiaris I found a spare 4TB SAS disk...replacing it now

And now we got

sudo /usr/local/lib/nagios/plugins/check_raid
OK: optimal, 1 logical, 12 physical
OK

Great. Thanks @Cmjohnson

ayounsi added a subscriber: ayounsi.
CRITICAL   (for 9d 15h 51m 18s)

cluster=misc device=megaraid,14 instance=helium:9100 job=node site=eqiad

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=helium&service=Device+not+healthy+-SMART-