Page MenuHomePhabricator

db1016 m1 master: Possibly faulty BBU
Closed, DeclinedPublic

Description

From Icinga:

db1016

MegaRAID
CRITICAL	2017-05-25 20:28:18	0d 0h 37m 17s	3/3	CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
Charging Status              : Discharging

Event Timeline

Marostegui triaged this task as Medium priority.May 25 2017, 8:40 PM
Marostegui added a project: ops-eqiad.

I've ack'ed the Icinga alarm with this task.

I've also forced a BBU learn cycle on db1016, it was looking good during the cycle, and as soon as the battery was having some charge it went back to WriteBack write policy, but at the end it gave up and went back to the failed state of the battery and to WriteThrough.
The state of the battery changed from None to Discharging though, but it doesn't have enough capacity, being from 2010. So I'd say we'll need to replace it unless we plan to replace the host very soon.

Before the cycle:

$ sudo megacli -AdpBbuCmd -GetBbuCapacityInfo -aAll


BBU Capacity Info for Adapter: 0

  Relative State of Charge: 31 %
  Absolute State of charge: 3 %
  Remaining Capacity: 56 mAh
  Full Charge Capacity: 183 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: Battery is not being charged.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 31
Max Error = 0 %
Remaining Capacity Alarm = 170 mAh
Remining Time Alarm = 10 Min
$ sudo megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4028 mV
Current: 0 mA
Temperature: 32 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : Yes
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No


GasGuageStatus:
  Fully Discharged        : No
  Fully Charged           : No
  Discharging             : No
  Initialized             : Yes
  Remaining Time Alarm    : No
  Discharge Terminated    : No
  Over Temperature        : No
  Charging Terminated     : No
  Over Charged            : No
Relative State of Charge: 31 %
Charger Status: In Progress
Remaining Capacity: 56 mAh
Full Charge Capacity: 183 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 32 %
  Absolute State of charge: 3 %
  Remaining Capacity: 59 mAh
  Full Charge Capacity: 183 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: Battery is not being charged.
  Estimated Time to full recharge: 37 Min.
  Cycle Count: 31
Max Error = 0 %
Remaining Capacity Alarm = 170 mAh
Remining Time Alarm = 10 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 11/17, 2010
  Design Capacity: 1700 mAh
  Design Voltage: 3700 mV
  Specification Info: 33
  Serial Number: 5092
  Pack Stat Configuration: 0x0000
  Manufacture Name: SANYO
  Firmware Version   :
  Device Name: DLNU209
  Device Chemistry: LION
  Battery FRU: N/A
  Transparent Learn = 0
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Warn via Event

Issued cycle:

$ sudo megacli -AdpBbuCmd -BbuLearn -aALL

Here some values during the cycle:

$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity'
Charger Status: In Progress
Remaining Capacity: 61 mAh
isSOHGood: Yes

... SNIP ...

$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity'
Charger Status: In Progress
Remaining Capacity: 110 mAh
isSOHGood: Yes

... SNIP ...

$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity'
Charger Status: In Progress
Remaining Capacity: 181 mAh
isSOHGood: Yes

... SNIP ...

$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity'
Charger Status: Off
Remaining Capacity: 16 mAh
isSOHGood: Yes

Current state:

$ sudo megacli -AdpBbuCmd -GetBbuCapacityInfo -aAll


BBU Capacity Info for Adapter: 0

  Relative State of Charge: 9 %
  Absolute State of charge: 1 %
  Remaining Capacity: 16 mAh
  Full Charge Capacity: 181 mAh
  Run time to empty: 1 Min.
  Average time to empty: 1 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 32
Max Error = 26 %
Remaining Capacity Alarm = 170 mAh
Remining Time Alarm = 10 Min
$ sudo megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3640 mV
Current: -647 mA
Temperature: 33 C
Battery State: Learning
BBU Firmware Status:

  Charging Status              : Discharging
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes
  Learn Cycle Active                      : Yes
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : Yes
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No


GasGuageStatus:
  Fully Discharged        : No
  Fully Charged           : No
  Discharging             : Yes
  Initialized             : Yes
  Remaining Time Alarm    : Yes
  Discharge Terminated    : No
  Over Temperature        : No
  Charging Terminated     : No
  Over Charged            : No
Relative State of Charge: 9 %
Charger Status: Off
Remaining Capacity: 16 mAh
Full Charge Capacity: 181 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 9 %
  Absolute State of charge: 1 %
  Remaining Capacity: 16 mAh
  Full Charge Capacity: 181 mAh
  Run time to empty: 1 Min.
  Average time to empty: 1 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 32
Max Error = 26 %
Remaining Capacity Alarm = 170 mAh
Remining Time Alarm = 10 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 11/17, 2010
  Design Capacity: 1700 mAh
  Design Voltage: 3700 mV
  Specification Info: 33
  Serial Number: 5092
  Pack Stat Configuration: 0x0000
  Manufacture Name: SANYO
  Firmware Version   :
  Device Name: DLNU209
  Device Chemistry: LION
  Battery FRU: N/A
  Transparent Learn = 0
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Warn via Event

I've ack'ed the Icinga alarm with this task.

I've also forced a BBU learn cycle on db1016, it was looking good during the cycle, and as soon as the battery was having some charge it went back to WriteBack write policy, but at the end it gave up and went back to the failed state of the battery and to WriteThrough.
The state of the battery changed from None to Discharging though, but it doesn't have enough capacity, being from 2010. So I'd say we'll need to replace it unless we plan to replace the host very soon.

Yes, we have seen that behaviour before with faulty BBUs :-(

It is now showing Optimal again:

BatteryType: BBU
Voltage: 4074 mV
Current: 0 mA
Temperature: 32 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No

And thus the RAID is back into WB:

Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

This went back to faulty again:

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

Raid went back to WriteThrough:

Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU

Forced a relearn cycle:

root@db1016:~#  megacli -AdpBbuCmd -BbuLearn -aALL -NoLog

Adapter 0: BBU Learn Succeeded.

And it is back:

05:51 < icinga-wm> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy




Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

And again:

˜/icinga-wm 9:11> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
This comment was removed by Marostegui.

Mentioned in SAL (#wikimedia-operations) [2017-06-21T05:41:17Z] <marostegui> Start relearn BBU cycle on db1016 - T166344

After issuing the relearn, the raid is back to WB:

˜/icinga-wm 7:48> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:49:18Z] <marostegui> Force BBU relearn on db1016 - T166344

˜/icinga-wm 14:48> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough


BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 27 %
  Absolute State of charge: 3 %
  Remaining Capacity: 56 mAh
  Full Charge Capacity: 208 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: Battery is not being charged.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 34
Max Error = 0 %

I have forced a relearn cycle

˜/icinga-wm 14:58> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

Mentioned in SAL (#wikimedia-operations) [2017-07-20T06:54:03Z] <marostegui> Force a BBU relearn on db1016 - T166344

Mentioned in SAL (#wikimedia-operations) [2017-07-20T08:34:09Z] <marostegui> Force a BBU relearn on db1016 - T166344

So for the record, after: T166344#3455435 we got:

˜/icinga-wm 9:04> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

Then we got:

˜/icinga-wm 10:34> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough

So I forced the relearn again on: T166344#3455546
The learning is still happening on the host...

˜/icinga-wm 12:44> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

As we discussed, it would be a good idea to do a switchover and get rid of this host, at least as a master of m1.

I have two proposals about how we can do it.

#1

Take one of the new hosts (db1098), reclone it and promote it to m1 master. Then we can either keep db1001 (no hw issues so far) as a slave just in case or keep db1016 (faulty BBU as a temporary slave). Or keep both.
Advantages:

  • Fast solution as we only need to clone one server

Cons:

  • We'd need another switchover before promoting the definitive master for m1, which will be one of the old hosts (old, but not one of the <db1050 as those will be decommissioned)

#2

Take an older host from one of the existing core shards, replaced it with db1098 for instance and move that old host to become m1 master.
Example:

Take db1066 (API s1 server)
Clone db1098 from db1066
Place db1098 as an API server in s1
Clone db1066 from db1001
Switchover m1 master db1016 to db1065

Advantages:

  • We are placing a definitive m1 master so no need to switchover again
  • By placing a powerful API server on s1, we will probably be able to reduce from 3 API servers to 2 (512GB server + the existing 160G one)

Disadvantages:

  • The above s1 slave was just an example of a server that can be moved
  • db1098's place might not be definitive, as we might need to move servers around once we deploy the multi-instance for core slaves.

The BBU is failing again, so we should try to give m1 master failover some priority amongst the other misc services.

Mentioned in SAL (#wikimedia-operations) [2017-08-07T06:20:46Z] <marostegui> Force BBU re-learn on db1016 - T166344

After forcing the relearn, this recovered:

˜/icinga-wm 8:29> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

And again: ˜/icinga-wm 10:09> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough

Mentioned in SAL (#wikimedia-operations) [2017-08-07T08:12:01Z] <marostegui> Force BBU re-learn on db1016 - T166344

Maybe we can setup m1 on db1069?

Maybe we can setup m1 on db1069?

I like that idea, I'll try to work on: T166546 soon as I am about to finish with: T153743

˜/icinga-wm 12:19> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

db1069 has been reused on s7, probably we should chose db1066 instead.

db1056 will be freed up during this week - we can use it to replace this host.

We can probably clone db1056 from db1001.

This host failed again and recovered itself:

03:16 < icinga-wm> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
05:56 < icinga-wm> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

Mentioned in SAL (#wikimedia-operations) [2018-01-22T13:46:27Z] <marostegui> Force BBU relearn on db1016 - T166344

This failed again - I have forced a relearn

After the relearn:

˜/icinga-wm 18:42> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

Not sure it will last like that for long anyways :)

This host is no longer the master and will be decommissioned - T190179