Page MenuHomePhabricator

Possibly faulty BBU on analytics1029
Closed, ResolvedPublic

Description

Alarms flapping for analytics1029 due to WriteBack policy of the Raid controller switching to WriteThrough.

elukey@analytics1029:~$ sudo megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 14 %
  Absolute State of charge: 0 %
  Remaining Capacity: 81 mAh
  Full Charge Capacity: 584 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 6 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 5
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 07/18, 2011
  Design Capacity: 90 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name:
  Firmware Version   : 0148 03
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0148 03
  Transparent Learn = 1
  App Data = 1

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

Event Timeline

elukey created this task.Oct 21 2017, 4:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 21 2017, 4:05 PM

Tried with sudo megacli -AdpBbuCmd -BbuLearn -aALL but the battery state seems still to be unknown and not charging :(

@Cmjohnson hi! I think that we might need a new battery...

fdans moved this task from Incoming to Radar on the Analytics board.Oct 23 2017, 3:38 PM
elukey triaged this task as Medium priority.Oct 24 2017, 9:40 AM
elukey added a project: User-Elukey.

Tried to force a learn cycle again, not much joy..

elukey@analytics1029:~$ sudo megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3506 mV
Current: 0 mA
Temperature: 60 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : Yes
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0428
Relative State of Charge: 5 %
Charger Status: Unknown
Remaining Capacity: 28 mAh
Full Charge Capacity: 584 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 5 %
  Absolute State of charge: 0 %
  Remaining Capacity: 28 mAh
  Full Charge Capacity: 584 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 2 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 5
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 07/18, 2011
  Design Capacity: 90 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name:
  Firmware Version   : 0148 03
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0148 03
  Transparent Learn = 1
  App Data = 1

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00
elukey moved this task from Backlog to Stalled on the User-Elukey board.Oct 30 2017, 9:34 AM
Cmjohnson assigned this task to RobH.Oct 30 2017, 2:18 PM
Cmjohnson added a subscriber: RobH.

this server is out of warranty by 6 months. Assigning to @RobH to determine if we should order a new one?

Cmjohnson moved this task from Backlog to Blocked on the ops-eqiad board.Oct 30 2017, 2:19 PM
RobH added a comment.EditedNov 9 2017, 4:45 PM

So the determination of ordering new hardware for failed will have to also rely on budgeting and if analytics can run without this host or require it.

@elukey: We should chat via irc about this. We'll need to sync up on the required hardware specifications for replacement and chat about budgets with @faidon as well.

RobH added a subscriber: faidon.Nov 9 2017, 4:48 PM
RobH added a comment.Nov 9 2017, 5:24 PM

Discussed some, Chris is going to pull a BBU out of a decom system to replace the defective one. Also analytics may have to start planning for the replacement of this batch of systems, since they are on their way to aging out.

These are larger 2U systems (R720xd), usually picked due to the number of disks needed. We should review the needs for this cluster in terms of overall capacity, i/o, and spindle count needs and re-evaluate this specification. (The db systems used to be similar to these, but are now 1U systems with 10 SSDs.)

Mentioned in SAL (#wikimedia-operations) [2017-11-13T18:20:40Z] <elukey> drain + shutdown analytics1029 as prep step to replace the BBU - T178742

Chris swapped the battery (two times) and it seems that the second one is ok! Will keep an eye on it and close the task tomorrow if everything is good.

elukey closed this task as Resolved.Nov 14 2017, 8:36 AM

Everything seems good, removed downtime for the host. Thanks Chris!

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM