Page MenuHomePhabricator

Possibly BBU issues on db1067
Closed, ResolvedPublic

Description

db1067 is showing really high temperature on the BBU

root@db1067:~#  megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3957 mV
Current: 0 mA
Temperature: 76 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : High
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0238
Relative State of Charge: 100 %
Charger Status: Complete
Remaining Capacity: 542 mAh
Full Charge Capacity: 542 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 100 %
  Absolute State of charge: 0 %
  Remaining Capacity: 542 mAh
  Full Charge Capacity: 542 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: 43 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 1
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 07/18, 2011
  Design Capacity: 90 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name:
  Firmware Version   : 0148 03
  Device Name:
  Device Chemistry:
  Battery FRU: N/A
Module Version = 0148 03
  Transparent Learn = 1
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

The policy for the RAID is WriteThru at the moment , even though the BBU isn't showing anything bad.

root@db1067:~# megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None

Given that this is the s1 candidate master, it is probably better to just replace the BBU and be on the safe side

I have been talking to @Cmjohnson and he is going to check if we have spare BBUs

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptMay 16 2018, 8:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui triaged this task as High priority.May 16 2018, 8:08 PM
Marostegui moved this task from Triage to In progress on the DBA board.

The BBU is definitely having some issues, I cannot even force a relearn:

root@db1067:~#  megacli -AdpBbuCmd -BbuLearn -aALL -NoLog

Adapter 0: BBU Learn Failed

Exit Code: 0x01

Mentioned in SAL (#wikimedia-operations) [2018-05-16T20:13:38Z] <marostegui> Force WriteBack policy on db1067 - T194852

I have manually set the policy to WriteBack so at least the server can catch up and not lag forever:

root@db1067:~# megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

Set Write Policy to Forced WriteBack on Adapter 0, VD 0 (target id: 0) success

Exit Code: 0x00
root@db1067:~# megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if Bad BBU

These are the logs from the BBU after the reboot for the rack change

seqNum: 0x0002e80d
Time: Wed May 16 16:35:05 2018

Code: 0x00000093
Class: 0
Locale: 0x08
Event Description: Battery started charging
Event Data:
===========
None


seqNum: 0x0002e80f
Time: Wed May 16 16:50:15 2018

Code: 0x000000f2
Class: 0
Locale: 0x08
Event Description: Battery charge complete
Event Data:
===========
None


seqNum: 0x0002e810
Time: Wed May 16 17:10:50 2018

Code: 0x00000091
Class: 1
Locale: 0x08
Event Description: Battery temperature is high
Event Data:
===========
None


seqNum: 0x0002e811
Time: Wed May 16 17:10:51 2018

Code: 0x000000c3
Class: 1
Locale: 0x08
Event Description: BBU disabled; changing WB virtual disks to WT, Forced WB VDs are not affected
Event Data:
===========

It definitely needs a replacement.

Mentioned in SAL (#wikimedia-operations) [2018-05-17T11:08:22Z] <marostegui> Stop MySQL and poweroff db1067 - T194852

For the record after having the server powered off for 1 hour:

Initial BBU temperature right after powerup: Temperature: 42 C

for comparison, also:

root@db1067:~$ cat /sys/class/thermal/thermal_zone*/temp
54000
45000
root@db1067:~$ date
Thu May 17 12:46:16 UTC 2018

After 10 minutes the temperature reached 45C

I have started MySQL now.

I have set back the default policy to WriteBack and WriteThru if the BBU is not present/broken. So the host is as it was before all the issues.

root@db1067:~# megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

This is still working fine - maybe it was a one time thing?

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 48 C
  Temperature                             : OK

Let's wait until tomorrow, and do another reboot.

Looks like it was a one time thing:

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 47 C
  Temperature                             : OK

I am going to reboot it again and see if it keeps fine after reboot.

Mentioned in SAL (#wikimedia-operations) [2018-05-18T05:18:16Z] <marostegui> Stop MySQL and reboot db1067 - T194852

After reboot:

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 48 C
  Temperature                             : OK

Change 433692 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1067.yaml: Disable notifications

https://gerrit.wikimedia.org/r/433692

Change 433692 merged by Marostegui:
[operations/puppet@production] db1067.yaml: Disable notifications

https://gerrit.wikimedia.org/r/433692

Still looking good after 10 hours:

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 47 C
  Temperature                             : OK

I am going to reboot it again and leave it for the weekend to see how it goes.

Mentioned in SAL (#wikimedia-operations) [2018-05-18T15:44:46Z] <marostegui> Stop MySQL and reboot db1067 - T194852

For the record, after the reboot:

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 48 C
  Temperature

After the weekend, everything looks fine:

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 47 C
  Temperature                             : OK

Going to give it one more reboot and see what happens. If it goes fine, I will repool it.

Mentioned in SAL (#wikimedia-operations) [2018-05-21T05:27:08Z] <marostegui> Stop MySQL and reboot db1067 - T194852

After the reboot:

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 48 C
  Temperature                             : OK

Change 434299 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1067.yaml: Enable notifications on db1067

https://gerrit.wikimedia.org/r/434299

Change 434299 merged by Marostegui:
[operations/puppet@production] db1067.yaml: Enable notifications on db1067

https://gerrit.wikimedia.org/r/434299

Still looking fine

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 47 C
  Temperature                             : OK

If by tomorrow morning (EU time) it is still good, I will repool it.

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 47 C
  Temperature                             : OK

Change 434435 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1067

https://gerrit.wikimedia.org/r/434435

Change 434435 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1067

https://gerrit.wikimedia.org/r/434435

Marostegui closed this task as Resolved.May 22 2018, 5:25 AM

I have repooled this host.
It didn't have any issues after many days and many reboots. So it was probably a one time thing.
Resolving for now.

Vvjjkkii renamed this task from Possibly BBU issues on db1067 to rucaaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from rucaaaaaaa to Possibly BBU issues on db1067.Jul 1 2018, 6:42 PM
Marostegui closed this task as Resolved.
Marostegui assigned this task to Cmjohnson.
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)