root@db1055:~$ megacli -AdpBbuCmd -a0
BBU status for Adapter: 0
BatteryType: BBU
Battery State: Unknown
Battery backup charge time : 0 hours
BBU Capacity Info for Adapter: 0
Relative State of Charge: 17 %
Absolute State of charge: 0 %
Remaining Capacity: 94 mAh
Full Charge Capacity: 577 mAh
Run time to empty: Battery is not being charged.
Average time to empty: 8 Min.
Estimated Time to full recharge: Battery is not being charged.
Cycle Count: 5
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min
BBU Design Info for Adapter: 0
Date of Manufacture: 07/18, 2011
Design Capacity: 90 mAh
Design Voltage: 0 mV
Specification Info: 0
Serial Number: 0
Pack Stat Configuration: 0x0000
Manufacture Name:
Firmware Version : 0148 03
Device Name:
Device Chemistry:
Battery FRU: N/A
Module Version = 0148 03
Transparent Learn = 1
App Data = 1
BBU Properties for Adapter: 0
Auto Learn Period: 90 Days
Next Learn time: None Learn Delay Interval:0 Hours
Auto-Learn Mode: Disabled
Exit Code: 0x00Description
Details
Related Objects
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2017-08-27T05:30:53Z] <marostegui> Force BBU relearn on db1055 - T174265
I will force a re-learn cycle on this host to see if the BBU comes back to optimal.
Anyhow, @Cmjohnson can we use a BBU of the servers that are ready to be decommissioned to replace it if it doesn't work? I don't think this server is under warranty anymore right? (racktables doesn't say so)
After the re-learn the BBU is back to Optimal and the RAID back to WB:
root@db1055:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 3438 mV Current: 0 mA Temperature: 48 C Battery State: Optimal BBU Firmware Status: root@db1055:~# megacli -ldinfo -l0 -a0 | grep Policy Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
This happened again, we definitely need to change the BBU //cc @Cmjohnson
root@db1055:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 3481 mV Current: 0 mA Temperature: 48 C Battery State: Degraded(Need Attention) A manual learn is required.
Mentioned in SAL (#wikimedia-operations) [2017-08-28T07:22:13Z] <marostegui> Force re-learn cycle on db1055 - https://phabricator.wikimedia.org/T174265
After forcing the re-learn again:
˜/icinga-wm 9:34> RECOVERY - MegaRAID on db1055 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
Let's just replace the BBU anyways.
Mentioned in SAL (#wikimedia-operations) [2017-08-29T14:33:28Z] <marostegui> Shutdown db1055 to replace its BBU - T174265
The BBU has been replaced and looks good:
root@db1055:/home/marostegui# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Voltage: 3918 mV Current: 0 mA Temperature: 42 C Battery State: Optimal BBU Firmware Status: Charging Status : None Voltage : OK Temperature : OK Learn Cycle Requested : Yes Learn Cycle Active : Yes Learn Cycle Status : OK Learn Cycle Timeout : No I2c Errors Detected : No Battery Pack Missing : No Battery Replacement required : No Remaining Capacity Low : No Periodic Learn Required : No Transparent Learn : Yes No space to cache offload : No Pack is about to fail & should be replaced : No Cache Offload premium feature required : No Module microcode update required : No BBU GasGauge Status: 0x0238 Relative State of Charge: 100 % Charger Status: Off Remaining Capacity: 480 mAh Full Charge Capacity: 480 mAh isSOHGood: Yes Battery backup charge time : 0 hours BBU Capacity Info for Adapter: 0 Relative State of Charge: 100 % Absolute State of charge: 0 % Remaining Capacity: 480 mAh Full Charge Capacity: 480 mAh Run time to empty: Battery is not being charged. Average time to empty: 38 Min. Estimated Time to full recharge: Battery is not being charged. Cycle Count: 3 Max Error = 0 % Remaining Capacity Alarm = 0 mAh Remining Time Alarm = 0 Min BBU Design Info for Adapter: 0 Date of Manufacture: 07/18, 2011 Design Capacity: 90 mAh Design Voltage: 0 mV Specification Info: 0 Serial Number: 0 Pack Stat Configuration: 0x0000 Manufacture Name: Firmware Version : 0148 03 Device Name: Device Chemistry: Battery FRU: N/A Module Version = 0148 03 Transparent Learn = 1 App Data = 0 BBU Properties for Adapter: 0 Auto Learn Period: 90 Days Next Learn time: None Learn Delay Interval:0 Hours Auto-Learn Mode: Disabled
The RAID looks good:
root@db1055:/home/marostegui# megacli -ldinfo -l0 -a0 | grep Policy Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
I have started MySQL but will not repool it today, I am going to leave it all night long to see how the BBU behaves, and will start giving it some weight tomorrow if all goes fine
Thanks a lot @Cmjohnson
As this is looking good after the whole night, I am going to start slowly repooling it back
Change 374701 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1055
Change 374701 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1055
Mentioned in SAL (#wikimedia-operations) [2017-08-30T07:12:24Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Slowly repool db1055 - T174265 (duration: 00m 52s)
Change 374709 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Increase weight on db1055
Change 374709 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Increase weight on db1055
Mentioned in SAL (#wikimedia-operations) [2017-08-30T08:06:19Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 47s)
Change 374741 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Give more weight to db1055
Change 374741 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Give more weight to db1055
Mentioned in SAL (#wikimedia-operations) [2017-08-30T08:51:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 48s)
Change 374777 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Give more weight to db1055
Change 374777 merged by Marostegui:
[operations/mediawiki-config@master] db-eqiad.php: Give more weight to db1055
Mentioned in SAL (#wikimedia-operations) [2017-08-30T09:49:21Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 47s)
Change 374783 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Give db1055 more traffic
Change 374783 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Give db1055 more traffic
Mentioned in SAL (#wikimedia-operations) [2017-08-30T10:24:42Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 47s)
Change 374787 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1055 with full weight
Change 374787 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1055 with full weight
Mentioned in SAL (#wikimedia-operations) [2017-08-30T11:06:10Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Restore db1055 original weight - T174265 (duration: 00m 46s)
The original weight values have been set now.
I will close this for now
Thanks @Cmjohnson for helping out so fast!