Page MenuHomePhabricator

BBU alarms flapping for analytics1038
Closed, ResolvedPublic

Description

analytics1038 seems to have an issue with the BBU, alarms keep flapping:

08:38  <icinga-wm> PROBLEM - MegaRAID on analytics1038 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, etc..
08:48  <icinga-wm> RECOVERY - MegaRAID on analytics1038 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy

I forced a manual re-learn with sudo megacli -AdpBbuCmd -BbuLearn -aALL -NoLog but it doesn't seem to have reached the desired result:

elukey@analytics1038:~$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3580 mV
Current: 0 mA
Temperature: 61 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : Yes
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0428
Relative State of Charge: 15 %
Charger Status: Unknown
Remaining Capacity: 79 mAh
Full Charge Capacity: 539 mAh
isSOHGood: Yes

The host is OOW, but if we have a spare BBU in the DC we might attempt a swap.

Event Timeline

elukey created this task.Jan 21 2018, 8:14 AM
Restricted Application added a project: Operations. · View Herald TranscriptJan 21 2018, 8:14 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is an older R720xd, and uses an older H710 controller.

While @Cmjohnson can check for a spare when back onsite, there is a good chance we don't have any. If there is a spare, its going to be another out of warranty part, with no promise that it'll work.

Ideally it can swap in with no reimage, but we'll see.

Is there any current roadmap to replace this system?

Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jan 26 2018, 7:10 PM
Dzahn triaged this task as Normal priority.Feb 1 2018, 9:45 PM
elukey added a comment.Feb 2 2018, 3:15 PM

This is an older R720xd, and uses an older H710 controller.
While @Cmjohnson can check for a spare when back onsite, there is a good chance we don't have any. If there is a spare, its going to be another out of warranty part, with no promise that it'll work.

Yep this is completely fine!

Is there any current roadmap to replace this system?

The hosts have a suggested replacement date in 2019, and we were not planning to swap them sooner, but if necessary we can talk about it!

Can this be done around 1500UTC 6 Feb? I will be swapping out another bbu at the same time.

elukey added a comment.Feb 2 2018, 5:36 PM

Can this be done around 1500UTC 6 Feb? I will be swapping out another bbu at the same time.

Fine to me! We have a big Hadoop maintenance in the morning but it should be done at 15:00 (if not I'll ask you to reschedule but it shouldn't happen). Thanks Chris!

Mentioned in SAL (#wikimedia-operations) [2018-02-06T15:36:46Z] <elukey> drain + shutdown of analytics1038 to replace faulty BBU - T185409

elukey added a comment.Feb 6 2018, 4:24 PM

Much better now!

elukey@analytics1038:~$ sudo megacli -AdpBbuCmd -GetBbuCapacityInfo -aAll


BBU Capacity Info for Adapter: 0

  Relative State of Charge: 81 %
  Absolute State of charge: 0 %
  Remaining Capacity: 412 mAh
  Full Charge Capacity: 510 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 33 Min.
  Estimated Time to full recharge: 51 Min.
  Cycle Count: 0
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

Exit Code: 0x00
elukey@analytics1038:~$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3966 mV
Current: 126 mA
Temperature: 43 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : Charging
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : Yes
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0128
Relative State of Charge: 81 %
Charger Status: In Progress
Remaining Capacity: 412 mAh
Full Charge Capacity: 510 mAh
isSOHGood: Yes

Exit Code: 0x00
elukey closed this task as Resolved.Feb 6 2018, 4:24 PM
elukey claimed this task.