Replace BBU for db1060
Closed, ResolvedPublic

Description

Hello!

db1060 had a failed BBU and that caused the slave to lag as it went into writethrough mode.

root@db1060:~# megacli -LDInfo -L0 -a0 | grep "Current Cache Policy:"
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
root@db1060:~# megacli -AdpBbuCmd -aAll

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3497 mV
Current: 0 mA
Temperature: 45 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
BBU Firmware Status:

  Charging Status              : None
BBU GasGauge Status: 0x0128
Relative State of Charge: 18 %
Charger Status: Unknown

Can we get another one?
I have changed the Policy to WriteBack to help the server with the replication lag:

root@db1060:~#  megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

Set Write Policy to Forced WriteBack on Adapter 0, VD 0 (target id: 0) success
root@db1060:~#   megacli -LDInfo -L0 -a0 | grep "Current Cache Policy:"
Current Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2017, 3:00 PM
Marostegui triaged this task as High priority.Feb 15 2017, 3:00 PM
jcrespo renamed this task from Replaced BBU for db1060 to Replace BBU for db1060.Feb 15 2017, 3:00 PM

Mentioned in SAL (#wikimedia-operations) [2017-02-15T15:57:23Z] <marostegui> (Old action but for the sake of getting it logged) Force RAID controller to work on WriteBack even with the broken BBU it has now on db1060 so it can keep up with the replication thread - T158194

@Cmjohnson were you able to find a replacement BBU in the end?
Thanks!

Marostegui moved this task from Triage to In progress on the DBA board.Feb 16 2017, 10:03 AM

@Cmjohnson sorry to push, but were you able to see if there's a replacement BBU? I wouldn't like to leave the server with WriteBack forced without the BBU as we might lose data if there is a power issue.
Thanks and sorry again for pushing on this.

Change 339200 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Depool db1060

https://gerrit.wikimedia.org/r/339200

Change 339200 merged by jenkins-bot:
db-eqiad.php: Depool db1060

https://gerrit.wikimedia.org/r/339200

Mentioned in SAL (#wikimedia-operations) [2017-02-22T16:20:50Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1060 - T158194 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-22T16:21:51Z] <marostegui> Shutdown db1060 for BBU replacement - T158194

Change 339226 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1060 with low weight

https://gerrit.wikimedia.org/r/339226

Thanks @Cmjohnson - the BBU now looks good!

root@db1060:~#  megacli -AdpBbuCmd -aAll

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3937 mV
Current: 468 mA
Temperature: 31 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : Charging

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 54 %
  Absolute State of charge: 0 %
  Remaining Capacity: 218 mAh
  Full Charge Capacity: 409 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 17 Min.
  Estimated Time to full recharge: 47 Min.

And after a while, it is charged:

Relative State of Charge: 100 %

I have set the default policy back to writethrough when the BBU has failed:

 megacli -LDSetProp  NoCachedBadBBU -L0 -a0

Set No Write Cache if bad BBU on Adapter 0, VD 0 (target id: 0) success

And after it got recharged, the policy is back to WriteBack

root@db1060:~# megacli -LDInfo -LAll -aAll | grep "Cache Policy:"
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Marostegui closed this task as Resolved.Feb 22 2017, 6:03 PM

Change 339226 merged by jenkins-bot:
db-eqiad.php: Repool db1060 with low weight

https://gerrit.wikimedia.org/r/339226

Repooled db1060 with less weight (and still not serving API again) so it can warm up a bit.

Mentioned in SAL (#wikimedia-operations) [2017-02-22T18:47:19Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1060 with less weight - T158194 (duration: 00m 39s)

Change 339349 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Restore db1060 original load

https://gerrit.wikimedia.org/r/339349

Change 339349 merged by jenkins-bot:
db-eqiad.php: Restore db1060 original load

https://gerrit.wikimedia.org/r/339349

Mentioned in SAL (#wikimedia-operations) [2017-02-23T07:06:29Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Restore db1060 original load - T158194 (duration: 00m 40s)