Page MenuHomePhabricator

MegaRAID error on an-worker1088
Closed, ResolvedPublic

Description

We have another megaraid failure on one of these hosts. It's in the same batch as those mentioned in : T318659: Multiple RAID battery failures on hadoop worker hosts

CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough

We can probably just checge the battery, but I will investigate first.

Event Timeline

Icinga downtime and Alertmanager silence (ID=19671b41-1b94-43a0-9de6-433c868243f3) set by btullis@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrading RAID controller firmware

an-worker1088.eqiad.wmnet

Mentioned in SAL (#wikimedia-analytics) [2023-05-09T12:59:28Z] <btullis> upgrading SAS RAID controller firmware on an-worker1088 for T336077

This was the battery statis information from this controller.

btullis@an-worker1088:~$ sudo megacli -AdpBbuCmd -aALL
                                     
BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 12 %
  Absolute State of charge: 0 %
  Remaining Capacity: 82 mAh
  Full Charge Capacity: 696 mAh
  Run time to empty: Battery is not being charged.  
  Average time to empty: 7 Min. 
  Estimated Time to full recharge: Battery is not being charged.  
  Cycle Count: 2
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 460 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x113
  Firmware Version   : 0.6
  Device Name: 
  Device Chemistry: 
  Battery FRU: N/A
Module Version = 0.6
  Transparent Learn = 1
  App Data = 1

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

I've decided to upgrade the firmware first, to see if this causes it to start charging.
However, it is in the batch of servers identified in T318659: Multiple RAID battery failures on hadoop worker hosts so I would not be at all surprised if the upgraded firmware makes little difference.

This is the relevant command and output:

btullis@an-worker1088:~$ sudo ./SAS-RAID_Firmware_700GG_LN_25.5.9.0001_A17.BIN 
Collecting inventory...
.^C.
Running validation...

PERC H730 Mini Controller 0

The version of this Update Package is newer than the currently installed version.
Software application name: PERC H730 Mini Controller 0 Firmware
Package version: 25.5.9.0001
Installed version: 25.5.5.0005



Continue? Y/N:y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER PRODUCTS WHILE UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
.............................................................................................
The operation was successful.
Would you like to reboot your system now?

Continue? Y/N:

Rebooting now.

Mentioned in SAL (#wikimedia-analytics) [2023-05-09T13:02:30Z] <btullis> rebooting an-worker1088 after firmware upgrade for T336077

I've created a child ticket for ops-eqiad to replace the battery for the RAID controller on this host.

BTullis claimed this task.
BTullis added a subscriber: Jclark-ctr.

Hyper-efficent work there from @Jclark-ctr. Many thanks.
Battery replaced and the RAID error has gone. Here's the latest status from the battery.

btullis@an-worker1088:~$ sudo megacli -AdpBbuCmd -aALL
                                     
BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3915 mV
Current: 499 mA
Temperature: 37 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : Charging
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0138 
Relative State of Charge: 41 %
Charger Status: In Progress
Remaining Capacity: 242 mAh
Full Charge Capacity: 601 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 41 %
  Absolute State of charge: 0 %
  Remaining Capacity: 246 mAh
  Full Charge Capacity: 601 mAh
  Run time to empty: Battery is not being charged.  
  Average time to empty: 20 Min. 
  Estimated Time to full recharge: 1 Hour, 25 Min. 
  Cycle Count: 2
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 460 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x113
  Firmware Version   : 0.6
  Device Name: 
  Device Chemistry: 
  Battery FRU: N/A
Module Version = 0.6
  Transparent Learn = 1
  App Data = 1

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

Resolving this ticket.