Page MenuHomePhabricator

Degraded RAID on an-worker1132
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host an-worker1132. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'an-worker1132', '-c', 'get_raid_status_megacli']': 'utf-8' codec can't decode byte 0x9c in position 1: invalid start byte

Event Timeline

Cmjohnson subscribed.

Submitted a ticket with Dell for a new HDD.

Create Dispatch: Success
You have successfully submitted request SR163405094.

@Cmjohnson this node has strange behaviour on raid/disks

All disks are really slow compare to ones on other nodes.
After looking at that it has indeed bad Current Cache policy set WT instead of WB on all disks:

Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

I've looked at the BBU but it seems to be fine

fraison@an-worker1132:/var/lib/hadoop/data/l/test$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL
                                     
BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3857 mV
Current: 0 mA
Temperature: 31 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0138 
Relative State of Charge: 87 %
Charger Status: Complete
Remaining Capacity: 360 mAh
Full Charge Capacity: 417 mAh
isSOHGood: Yes

Exit Code: 0x00

Also tried to enforce usage of WB but still no change

sudo megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

Running bbu learn failed so I wonder if still the BBU is in bad state

nfraison@an-worker1132:/var/lib/hadoop/data/l/test$ sudo megacli -AdpBbuCmd -BbuLearn -aALL -NoLog
                                     
Adapter 0: BBU Learn Failed

Exit Code: 0x01

Do you think that we can replace that BBU or do you see other potential issue?

We can replace the BBU, let's get the disk replaced first and then create a new ticket for a BBU

The disk has been swapped and back online. I am resolving this task and creating a new one for the BBU.

Strangely since the change of disk everything is back to normal

RECOVERY - MegaRAID on an-worker1132 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring

Let's not change the BBU, I will do some checks and create the ticket if needed