BBU issues on db1055, RAID cache on WriteThrough
Closed, ResolvedPublic

Description

root@db1055:~$ megacli -AdpBbuCmd  -a0
                                     
BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 17 %
  Absolute State of charge: 0 %
  Remaining Capacity: 94 mAh
  Full Charge Capacity: 577 mAh
  Run time to empty: Battery is not being charged.  
  Average time to empty: 8 Min. 
  Estimated Time to full recharge: Battery is not being charged.  
  Cycle Count: 5
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 07/18, 2011
  Design Capacity: 90 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 
  Firmware Version   : 0148 03
  Device Name: 
  Device Chemistry: 
  Battery FRU: N/A
Module Version = 0148 03
  Transparent Learn = 1
  App Data = 1

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00
jcrespo created this task.Aug 27 2017, 12:49 AM
Restricted Application added a project: Operations. · View Herald TranscriptAug 27 2017, 12:49 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2017-08-27T05:30:53Z] <marostegui> Force BBU relearn on db1055 - T174265

I will force a re-learn cycle on this host to see if the BBU comes back to optimal.
Anyhow, @Cmjohnson can we use a BBU of the servers that are ready to be decommissioned to replace it if it doesn't work? I don't think this server is under warranty anymore right? (racktables doesn't say so)

After the re-learn the BBU is back to Optimal and the RAID back to WB:

root@db1055:~#  megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3438 mV
Current: 0 mA
Temperature: 48 C
Battery State: Optimal
BBU Firmware Status:



root@db1055:~# megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

This happened again, we definitely need to change the BBU //cc @Cmjohnson

root@db1055:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3481 mV
Current: 0 mA
Temperature: 48 C
Battery State: Degraded(Need Attention)
		A manual learn is required.

Mentioned in SAL (#wikimedia-operations) [2017-08-28T07:22:13Z] <marostegui> Force re-learn cycle on db1055 - https://phabricator.wikimedia.org/T174265

After forcing the re-learn again:

˜/icinga-wm 9:34> RECOVERY - MegaRAID on db1055 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

Let's just replace the BBU anyways.

Marostegui triaged this task as High priority.Aug 28 2017, 7:50 AM

And failed again.
Let's not spend more time on this and just replace it.

Cmjohnson moved this task from Backlog to Being worked on on the ops-eqiad board.Aug 28 2017, 8:41 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-29T14:33:28Z] <marostegui> Shutdown db1055 to replace its BBU - T174265

Marostegui moved this task from Triage to In progress on the DBA board.Aug 29 2017, 3:03 PM

The BBU has been replaced and looks good:

root@db1055:/home/marostegui# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3918 mV
Current: 0 mA
Temperature: 42 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes
  Learn Cycle Active                      : Yes
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : Yes
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0238
Relative State of Charge: 100 %
Charger Status: Off
Remaining Capacity: 480 mAh
Full Charge Capacity: 480 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 100 %
  Absolute State of charge: 0 %
  Remaining Capacity: 480 mAh
  Full Charge Capacity: 480 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 38 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 3
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 07/18, 2011
  Design Capacity: 90 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name:
  Firmware Version   : 0148 03
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0148 03
  Transparent Learn = 1
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

The RAID looks good:

root@db1055:/home/marostegui# megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

I have started MySQL but will not repool it today, I am going to leave it all night long to see how the BBU behaves, and will start giving it some weight tomorrow if all goes fine

Thanks a lot @Cmjohnson

As this is looking good after the whole night, I am going to start slowly repooling it back

Change 374701 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1055

https://gerrit.wikimedia.org/r/374701

Change 374701 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1055

https://gerrit.wikimedia.org/r/374701

Mentioned in SAL (#wikimedia-operations) [2017-08-30T07:12:24Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Slowly repool db1055 - T174265 (duration: 00m 52s)

Change 374709 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Increase weight on db1055

https://gerrit.wikimedia.org/r/374709

Change 374709 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Increase weight on db1055

https://gerrit.wikimedia.org/r/374709

Mentioned in SAL (#wikimedia-operations) [2017-08-30T08:06:19Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 47s)

Change 374741 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Give more weight to db1055

https://gerrit.wikimedia.org/r/374741

Change 374741 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Give more weight to db1055

https://gerrit.wikimedia.org/r/374741

Mentioned in SAL (#wikimedia-operations) [2017-08-30T08:51:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 48s)

Change 374777 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Give more weight to db1055

https://gerrit.wikimedia.org/r/374777

Change 374777 merged by Marostegui:
[operations/mediawiki-config@master] db-eqiad.php: Give more weight to db1055

https://gerrit.wikimedia.org/r/374777

Mentioned in SAL (#wikimedia-operations) [2017-08-30T09:49:21Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 47s)

Change 374783 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Give db1055 more traffic

https://gerrit.wikimedia.org/r/374783

Change 374783 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Give db1055 more traffic

https://gerrit.wikimedia.org/r/374783

Mentioned in SAL (#wikimedia-operations) [2017-08-30T10:24:42Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 47s)

Change 374787 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1055 with full weight

https://gerrit.wikimedia.org/r/374787

Change 374787 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1055 with full weight

https://gerrit.wikimedia.org/r/374787

Mentioned in SAL (#wikimedia-operations) [2017-08-30T11:06:10Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Restore db1055 original weight - T174265 (duration: 00m 46s)

Marostegui closed this task as Resolved.Aug 30 2017, 11:06 AM
Marostegui assigned this task to Cmjohnson.

The original weight values have been set now.
I will close this for now

Thanks @Cmjohnson for helping out so fast!