Page MenuHomePhabricator

helium (bacula) - Device not healthy -SMART-
Closed, DuplicatePublic

Description

helium, the bacula server, shows up in Icinga with a SMART not healthy alert:

cluster=misc device=megaraid,10 instance=helium:9100 job=node site=eqiad

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=helium&service=Device+not+healthy+-SMART-

Event Timeline

Dzahn created this task.Sep 25 2018, 12:53 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2018, 12:53 AM

also see T189801 and T196478 which are about setting up backup1001 to replace helium

herron triaged this task as High priority.Oct 2 2018, 5:22 PM

Swapped the failed disk

Dzahn reassigned this task from Dzahn to Cmjohnson.Oct 2 2018, 7:48 PM

also T206004

unfortunately it still says the RAID is "partially degraded" while the Icinga SMART alert has recovered.

@Dzahn the disk was replaced but it's unconfigured good ....I have not tried to add it back but no success. can you give it a go please

Dzahn added a comment.Oct 3 2018, 9:13 PM

I don't know how to do that. How did you try it? Are there maybe docs or examples how that is usually done?

akosiaris reopened this task as Open.Oct 4 2018, 1:40 PM

I followed http://erikimh.com/megacli-cheatsheet/ to do so

and

megacli -PdReplaceMissing -PhysDrv [15:9] -Array0 -row9 -a0
                                     
Adapter: 0: Failed to replace Missing PD at Array 0, Row 9.

FW error description: 
  The specified physical drive does not have the appropriate attributes to complete the requested command.  

Exit Code: 0x26

Which had me wondering what on earth and then I found https://www.thomas-krenn.com/de/wiki/MegaCLI_Error_Messages which says

0x26 Unable to use SATA(SAS) drive to replace SAS(SATA)

and sure enough

megacli -PDList -aALL | grep 'PD Type'
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SAS
PD Type: SATA
PD Type: SAS
PD Type: SAS

@Cmjohnson where did that disk come from ?

The disk was a spare...i didn't even look to see that it was a SATA disk.
This server is out of warranty and we'll need to buy 4TB SAS disks

Dzahn added a comment.Oct 4 2018, 3:19 PM

Maybe it makes sense to prioritize T196478 instead?

Maybe it makes sense to prioritize T196478 instead?

That's what we 've being down up to now more or less. But it doesn't look good either timewise. See T203827 (I 'll add it as a blocker on T196478)

Dzahn added a comment.Oct 4 2018, 9:19 PM

I see.. hmm. yea, then we should buy a replacement disk.

@akosiaris I found a spare 4TB SAS disk...replacing it now

akosiaris closed this task as Resolved.Oct 8 2018, 9:05 AM

And now we got

sudo /usr/local/lib/nagios/plugins/check_raid
OK: optimal, 1 logical, 12 physical
OK

Great. Thanks @Cmjohnson

ayounsi reopened this task as Open.Tue, Jul 9, 12:49 AM
ayounsi added a subscriber: ayounsi.
CRITICAL   (for 9d 15h 51m 18s)

cluster=misc device=megaraid,14 instance=helium:9100 job=node site=eqiad

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=helium&service=Device+not+healthy+-SMART-