Degraded RAID on an-worker1132
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Mar 2 2023, 5:09 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host an-worker1132. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'an-worker1132', '-c', 'get_raid_status_megacli']': 'utf-8' codec can't decode byte 0x9c in position 1: invalid start byte

Related Objects

Mentioned In: T331543: anworker1132 BBU issue/replacement
T330979: Investigate slownesses on an-worker1132

Event Timeline

ops-monitoring-bot created this task.Mar 2 2023, 5:09 AM

Peachey88 added a project: Data-Engineering.Mar 2 2023, 9:47 AM

Submitted a ticket with Dell for a new HDD.

Create Dispatch: Success
You have successfully submitted request SR163405094.

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Mar 2 2023, 3:53 PM

• nfraison mentioned this in T330979: Investigate slownesses on an-worker1132.Mar 2 2023, 3:53 PM

Peachey88 merged a task: T331073: Degraded RAID on an-worker1132.Mar 3 2023, 3:08 AM

Peachey88 merged a task: T331068: Degraded RAID on an-worker1132.

Peachey88 merged a task: T331064: Degraded RAID on an-worker1132.

Peachey88 merged a task: T331059: Degraded RAID on an-worker1132.

@Cmjohnson this node has strange behaviour on raid/disks

All disks are really slow compare to ones on other nodes.
After looking at that it has indeed bad Current Cache policy set WT instead of WB on all disks:

Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

I've looked at the BBU but it seems to be fine

fraison@an-worker1132:/var/lib/hadoop/data/l/test$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL
                                     
BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3857 mV
Current: 0 mA
Temperature: 31 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0138 
Relative State of Charge: 87 %
Charger Status: Complete
Remaining Capacity: 360 mAh
Full Charge Capacity: 417 mAh
isSOHGood: Yes

Exit Code: 0x00

Also tried to enforce usage of WB but still no change

sudo megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

Running bbu learn failed so I wonder if still the BBU is in bad state

nfraison@an-worker1132:/var/lib/hadoop/data/l/test$ sudo megacli -AdpBbuCmd -BbuLearn -aALL -NoLog
                                     
Adapter 0: BBU Learn Failed

Exit Code: 0x01

Do you think that we can replace that BBU or do you see other potential issue?

RhinosF1 subscribed.Mar 3 2023, 10:36 AM

We can replace the BBU, let's get the disk replaced first and then create a new ticket for a BBU

lbowmaker moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.Mar 3 2023, 2:55 PM

The disk has been swapped and back online. I am resolving this task and creating a new one for the BBU.

Strangely since the change of disk everything is back to normal

RECOVERY - MegaRAID on an-worker1132 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring

Let's not change the BBU, I will do some checks and create the ticket if needed

RhinosF1 mentioned this in T331543: anworker1132 BBU issue/replacement.Mar 8 2023, 4:02 PM

Degraded RAID on an-worker1132Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on an-worker1132
Closed, ResolvedPublic
Actions