Page MenuHomePhabricator

RAID battery malfunction in an-worker1081
Closed, ResolvedPublic

Description

We are aware of an incident that is affecting the RAID controller on an-worker1081.
The RAID cache battery backup appears to have been fully discharged and is not charging, therefore the cache has switched from WriteBack to WriteThrough for each of the 13 logical drives.

image.png (36×1 px, 8 KB)

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=an-worker1081&service=MegaRAID

Examining the RAID controller states that the battery doesn't need replacing:

btullis@an-worker1081:~$ sudo megacli -AdpBbuCmd -aALL|grep -i replace
  Battery Replacement required            : No
  Pack is about to fail & should be replaced : No

...but it doesn't seem to be charging either.

The full output from megacli is here:

btullis@an-worker1081:~$ sudo megacli -AdpBbuCmd -aALL

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3387 mV
Current: 0 mA
Temperature: 49 C
Battery State: Degraded(Need Attention)
                A manual learn is required.
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested                   : Yes
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : Yes
  Periodic Learn Required                 : No
  Transparent Learn                       : Yes
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0128
Relative State of Charge: 15 %
Charger Status: Unknown
Remaining Capacity: 91 mAh
Full Charge Capacity: 632 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 15 %
  Absolute State of charge: 0 %
  Remaining Capacity: 91 mAh
  Full Charge Capacity: 632 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 7 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 2
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 460 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x113
  Firmware Version   : 0.6
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0.6
  Transparent Learn = 1
  App Data = 1

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

I tried triggering a learn cycle with sudo megacli -AdpBbuCmd -BbuLearn -aAll but it doesn't seem to have worked.

Event Timeline

BTullis triaged this task as Medium priority.May 12 2022, 5:15 PM
BTullis moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.
BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

Learn cycle requested, but it has not been started.

btullis@an-worker1081:~$ sudo megacli -AdpBbuCmd -BbuLearn -a0

Adapter 0: BBU Learn Succeeded.

Exit Code: 0x00
btullis@an-worker1081:~$ sudo megacli -AdpBbuCmd -aALL|grep -i learn
                A manual learn is required.
  Learn Cycle Requested                   : Yes
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  Periodic Learn Required                 : No
  Transparent Learn                       : Yes
  Transparent Learn = 1
  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

I might try powering off the server to reset the RAID controller.

Mentioned in SAL (#wikimedia-operations) [2022-05-16T11:59:28Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1081.eqiad.wmnet with reason: T308267

Mentioned in SAL (#wikimedia-operations) [2022-05-16T11:59:34Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1081.eqiad.wmnet with reason: T308267

I shut down the host to cycle the power on the RAID controller card.
Now booting the server again.

The following message was displayed on boot.

image.png (466×693 px, 50 KB)

However it did proceed past this message and has now booted.

Even after a power cycle it is still not charging so I can't think of any other course of action other than to seek a replacement.

btullis@an-worker1081:~$ sudo megacli -AdpBbuCmd -aALL

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3069 mV
Current: 0 mA
Temperature: 52 C
Battery State: Degraded(Need Attention)
                A manual learn is required.
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested                   : Yes
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : Yes
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x013e
Relative State of Charge: 0 %
Charger Status: Unknown
Remaining Capacity: 0 mAh
Full Charge Capacity: 547 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 0 %
  Absolute State of charge: 0 %
  Remaining Capacity: 0 mAh
  Full Charge Capacity: 547 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty:
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 2
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 460 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x113
  Firmware Version   : 0.6
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0.6
  Transparent Learn = 1
  App Data = 1

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

I have created a sub-task and assigned it to ops-eqiad
In the meantime I'll move this ticket to paused and the worker will continue to run with reduced performance.

This incident has now been resolved, with sincere thanks to @wiki_willy and @Jclark-ctr.

image.png (25×1 px, 6 KB)