From Icinga:
db1016 MegaRAID CRITICAL 2017-05-25 20:28:18 0d 0h 37m 17s 3/3 CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
Charging Status : Discharging
From Icinga:
db1016 MegaRAID CRITICAL 2017-05-25 20:28:18 0d 0h 37m 17s 3/3 CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
Charging Status : Discharging
I've ack'ed the Icinga alarm with this task.
I've also forced a BBU learn cycle on db1016, it was looking good during the cycle, and as soon as the battery was having some charge it went back to WriteBack write policy, but at the end it gave up and went back to the failed state of the battery and to WriteThrough.
The state of the battery changed from None to Discharging though, but it doesn't have enough capacity, being from 2010. So I'd say we'll need to replace it unless we plan to replace the host very soon.
Before the cycle:
$ sudo megacli -AdpBbuCmd -GetBbuCapacityInfo -aAll BBU Capacity Info for Adapter: 0 Relative State of Charge: 31 % Absolute State of charge: 3 % Remaining Capacity: 56 mAh Full Charge Capacity: 183 mAh Run time to empty: Battery is not being charged. Average time to empty: Battery is not being charged. Estimated Time to full recharge: Battery is not being charged. Cycle Count: 31 Max Error = 0 % Remaining Capacity Alarm = 170 mAh Remining Time Alarm = 10 Min
$ sudo megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 4028 mV Current: 0 mA Temperature: 32 C Battery State: Degraded(Need Attention) A manual learn is required. BBU Firmware Status: Charging Status : None Voltage : OK Temperature : OK Learn Cycle Requested : Yes Learn Cycle Active : No Learn Cycle Status : OK Learn Cycle Timeout : No I2c Errors Detected : No Battery Pack Missing : No Battery Replacement required : No Remaining Capacity Low : Yes Periodic Learn Required : No Transparent Learn : No No space to cache offload : No Pack is about to fail & should be replaced : No Cache Offload premium feature required : No Module microcode update required : No GasGuageStatus: Fully Discharged : No Fully Charged : No Discharging : No Initialized : Yes Remaining Time Alarm : No Discharge Terminated : No Over Temperature : No Charging Terminated : No Over Charged : No Relative State of Charge: 31 % Charger Status: In Progress Remaining Capacity: 56 mAh Full Charge Capacity: 183 mAh isSOHGood: Yes Battery backup charge time : 0 hours BBU Capacity Info for Adapter: 0 Relative State of Charge: 32 % Absolute State of charge: 3 % Remaining Capacity: 59 mAh Full Charge Capacity: 183 mAh Run time to empty: Battery is not being charged. Average time to empty: Battery is not being charged. Estimated Time to full recharge: 37 Min. Cycle Count: 31 Max Error = 0 % Remaining Capacity Alarm = 170 mAh Remining Time Alarm = 10 Min BBU Design Info for Adapter: 0 Date of Manufacture: 11/17, 2010 Design Capacity: 1700 mAh Design Voltage: 3700 mV Specification Info: 33 Serial Number: 5092 Pack Stat Configuration: 0x0000 Manufacture Name: SANYO Firmware Version : Device Name: DLNU209 Device Chemistry: LION Battery FRU: N/A Transparent Learn = 0 App Data = 0 BBU Properties for Adapter: 0 Auto Learn Period: 90 Days Next Learn time: None Learn Delay Interval:0 Hours Auto-Learn Mode: Warn via Event
Issued cycle:
$ sudo megacli -AdpBbuCmd -BbuLearn -aALL
Here some values during the cycle:
$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' Charger Status: In Progress Remaining Capacity: 61 mAh isSOHGood: Yes ... SNIP ... $ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' Charger Status: In Progress Remaining Capacity: 110 mAh isSOHGood: Yes ... SNIP ... $ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' Charger Status: In Progress Remaining Capacity: 181 mAh isSOHGood: Yes ... SNIP ... $ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' Charger Status: Off Remaining Capacity: 16 mAh isSOHGood: Yes
Current state:
$ sudo megacli -AdpBbuCmd -GetBbuCapacityInfo -aAll BBU Capacity Info for Adapter: 0 Relative State of Charge: 9 % Absolute State of charge: 1 % Remaining Capacity: 16 mAh Full Charge Capacity: 181 mAh Run time to empty: 1 Min. Average time to empty: 1 Min. Estimated Time to full recharge: Battery is not being charged. Cycle Count: 32 Max Error = 26 % Remaining Capacity Alarm = 170 mAh Remining Time Alarm = 10 Min
$ sudo megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 3640 mV Current: -647 mA Temperature: 33 C Battery State: Learning BBU Firmware Status: Charging Status : Discharging Voltage : OK Temperature : OK Learn Cycle Requested : Yes Learn Cycle Active : Yes Learn Cycle Status : OK Learn Cycle Timeout : No I2c Errors Detected : No Battery Pack Missing : No Battery Replacement required : No Remaining Capacity Low : Yes Periodic Learn Required : No Transparent Learn : No No space to cache offload : No Pack is about to fail & should be replaced : No Cache Offload premium feature required : No Module microcode update required : No GasGuageStatus: Fully Discharged : No Fully Charged : No Discharging : Yes Initialized : Yes Remaining Time Alarm : Yes Discharge Terminated : No Over Temperature : No Charging Terminated : No Over Charged : No Relative State of Charge: 9 % Charger Status: Off Remaining Capacity: 16 mAh Full Charge Capacity: 181 mAh isSOHGood: Yes Battery backup charge time : 0 hours BBU Capacity Info for Adapter: 0 Relative State of Charge: 9 % Absolute State of charge: 1 % Remaining Capacity: 16 mAh Full Charge Capacity: 181 mAh Run time to empty: 1 Min. Average time to empty: 1 Min. Estimated Time to full recharge: Battery is not being charged. Cycle Count: 32 Max Error = 26 % Remaining Capacity Alarm = 170 mAh Remining Time Alarm = 10 Min BBU Design Info for Adapter: 0 Date of Manufacture: 11/17, 2010 Design Capacity: 1700 mAh Design Voltage: 3700 mV Specification Info: 33 Serial Number: 5092 Pack Stat Configuration: 0x0000 Manufacture Name: SANYO Firmware Version : Device Name: DLNU209 Device Chemistry: LION Battery FRU: N/A Transparent Learn = 0 App Data = 0 BBU Properties for Adapter: 0 Auto Learn Period: 90 Days Next Learn time: None Learn Delay Interval:0 Hours Auto-Learn Mode: Warn via Event
It is now showing Optimal again:
BatteryType: BBU Voltage: 4074 mV Current: 0 mA Temperature: 32 C Battery State: Optimal BBU Firmware Status: Charging Status : None Voltage : OK Temperature : OK Learn Cycle Requested : No
And thus the RAID is back into WB:
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
This went back to faulty again:
BatteryType: BBU Battery State: Unknown Battery backup charge time : 0 hours
Raid went back to WriteThrough:
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU
Forced a relearn cycle:
root@db1016:~# megacli -AdpBbuCmd -BbuLearn -aALL -NoLog Adapter 0: BBU Learn Succeeded.
And it is back:
05:51 < icinga-wm> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
And again:
˜/icinga-wm 9:11> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
Mentioned in SAL (#wikimedia-operations) [2017-06-21T05:41:17Z] <marostegui> Start relearn BBU cycle on db1016 - T166344
After issuing the relearn, the raid is back to WB:
˜/icinga-wm 7:48> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
Mentioned in SAL (#wikimedia-operations) [2017-07-05T12:49:18Z] <marostegui> Force BBU relearn on db1016 - T166344
˜/icinga-wm 14:48> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough BBU status for Adapter: 0 BatteryType: BBU Battery State: Unknown Battery backup charge time : 0 hours BBU Capacity Info for Adapter: 0 Relative State of Charge: 27 % Absolute State of charge: 3 % Remaining Capacity: 56 mAh Full Charge Capacity: 208 mAh Run time to empty: Battery is not being charged. Average time to empty: Battery is not being charged. Estimated Time to full recharge: Battery is not being charged. Cycle Count: 34 Max Error = 0 %
I have forced a relearn cycle
˜/icinga-wm 14:58> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
Mentioned in SAL (#wikimedia-operations) [2017-07-20T06:54:03Z] <marostegui> Force a BBU relearn on db1016 - T166344
Mentioned in SAL (#wikimedia-operations) [2017-07-20T08:34:09Z] <marostegui> Force a BBU relearn on db1016 - T166344
So for the record, after: T166344#3455435 we got:
˜/icinga-wm 9:04> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
Then we got:
˜/icinga-wm 10:34> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
So I forced the relearn again on: T166344#3455546
The learning is still happening on the host...
˜/icinga-wm 12:44> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
As we discussed, it would be a good idea to do a switchover and get rid of this host, at least as a master of m1.
I have two proposals about how we can do it.
#1
Take one of the new hosts (db1098), reclone it and promote it to m1 master. Then we can either keep db1001 (no hw issues so far) as a slave just in case or keep db1016 (faulty BBU as a temporary slave). Or keep both.
Advantages:
Cons:
#2
Take an older host from one of the existing core shards, replaced it with db1098 for instance and move that old host to become m1 master.
Example:
Take db1066 (API s1 server)
Clone db1098 from db1066
Place db1098 as an API server in s1
Clone db1066 from db1001
Switchover m1 master db1016 to db1065
Advantages:
Disadvantages:
The BBU is failing again, so we should try to give m1 master failover some priority amongst the other misc services.
Mentioned in SAL (#wikimedia-operations) [2017-08-07T06:20:46Z] <marostegui> Force BBU re-learn on db1016 - T166344
After forcing the relearn, this recovered:
˜/icinga-wm 8:29> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
And again: ˜/icinga-wm 10:09> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough
Mentioned in SAL (#wikimedia-operations) [2017-08-07T08:12:01Z] <marostegui> Force BBU re-learn on db1016 - T166344
˜/icinga-wm 12:19> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
This host failed again and recovered itself:
03:16 < icinga-wm> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough 05:56 < icinga-wm> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
Mentioned in SAL (#wikimedia-operations) [2018-01-22T13:46:27Z] <marostegui> Force BBU relearn on db1016 - T166344
After the relearn:
˜/icinga-wm 18:42> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
Not sure it will last like that for long anyways :)