Page MenuHomePhabricator

x1 master db1031: Faulty BBU
Closed, DuplicatePublic

Description

Hello,

db1031 seems to have a broken BBU:
First a high increase on disk utilization: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1031&var-network=eth0&from=1495512674378&to=1495527469190&panelId=19&fullscreen

root@db1031:~#  megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4021 mV
Current: 0 mA
Temperature: 32 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
BBU Firmware Status:

  Charging Status              : None

After seeing that, we can see the policy being WriteThrough:

root@db1031:~# megacli -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.633 TB
Sector Size         : 512
Mirror Data         : 1.633 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU

I forced a learn cycle first to see if that helped the BBU to recover (as it did sometimes on db1048 (T160731#3109104)

root@db1031:~# megacli -AdpBbuCmd -BbuLearn -aALL -NoLog

Adapter 0: BBU Learn Succeeded.

Exit Code: 0x00

But nothing ever happened after a while:

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4021 mV
Current: 0 mA
Temperature: 32 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes

I disabled the learning cycle just in case, before setting the default policy to WriteBack to avoid any issues with the BBU misbehaving again:

root@db1031:~#  megacli -AdpBbuCmd -a0 | grep Auto-Learn
  Auto-Learn Mode: Warn via Event
root@db1031:~#  echo "autoLearnMode=1" > disable_learn
root@db1031:~# megacli -AdpBbuCmd -SetBbuProperties -f disable_learn -a0

Adapter 0: Set BBU Properties Succeeded.

Exit Code: 0x00

And forced the WB by default:

root@db1031:~#  megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

Set Write Policy to Forced WriteBack on Adapter 0, VD 0 (target id: 0) success

Exit Code: 0x00
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

And once it was set to WB, the disk IO dropped down as it can be see on: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1031&var-network=eth0&from=1495511890227&to=1495527409190

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

After a long while the BBU shows Optimal again, so looks like the manual relearn worked (the same way it did on db1048 - T160731#3109104 )
Setting the policy back to its default now shows Current: WriteBack - same thing db1048 normally did:

root@db1031:~# megacli -LDSetProp  NoCachedBadBBU -L0 -a0

Set No Write Cache if bad BBU on Adapter 0, VD 0 (target id: 0) success

Exit Code: 0x00
root@db1031:~# megacli -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.633 TB
Sector Size         : 512
Mirror Data         : 1.633 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No

Let's wait to see if this happens again (on db1048 it was good for some weeks and it showed up again). It did several times until we replaced the BBU.

Change 355190 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] raid-check: Return critical in WriteThough mode for megacli

https://gerrit.wikimedia.org/r/355190

And this happened again https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1031&var-network=eth0&from=1495532268716&to=1495534072869&panelId=19&fullscreen:

root@db1031:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3830 mV
Current: -685 mA
Temperature: 33 C
Battery State: Learning
BBU Firmware Status:

  Charging Status              : Discharging
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes
  Learn Cycle Active                      : Yes
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : Yes
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No


GasGuageStatus:
  Fully Discharged        : No
  Fully Charged           : No
  Discharging             : Yes
  Initialized             : Yes
  Remaining Time Alarm    : Yes
  Discharge Terminated    : No
  Over Temperature        : No
  Charging Terminated     : No
  Over Charged            : No
Relative State of Charge: 9 %
Charger Status: Off
Remaining Capacity: 21 mAh
Full Charge Capacity: 245 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 9 %
  Absolute State of charge: 1 %
  Remaining Capacity: 21 mAh
  Full Charge Capacity: 245 mAh
  Run time to empty: 1 Min.
  Average time to empty: 1 Min.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 32
Max Error = 6 %
Remaining Capacity Alarm = 170 mAh
Remining Time Alarm = 10 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 11/17, 2010
  Design Capacity: 1700 mAh
  Design Voltage: 3700 mV
  Specification Info: 33
  Serial Number: 5467
  Pack Stat Configuration: 0x0008
  Manufacture Name: SANYO
  Firmware Version   :
  Device Name: DLNU209
  Device Chemistry: LION
  Battery FRU: N/A
  Transparent Learn = 0
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU

So I forced it to WB:

root@db1031:~# megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

Set Write Policy to Forced WriteBack on Adapter 0, VD 0 (target id: 0) success

Exit Code: 0x00
root@db1031:~# megacli -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.633 TB
Sector Size         : 512
Mirror Data         : 1.633 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU

Change 355190 merged by Jcrespo:
[operations/puppet@production] raid-check: Return critical when not in WriteBack mode for megacli

https://gerrit.wikimedia.org/r/355190

Mentioned in SAL (#wikimedia-operations) [2017-05-23T14:25:22Z] <jynus> deploying new check_raid monitoring write policy for megacli T166108

Change 355246 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] raid-check: optionally return critical when not in a write policy

https://gerrit.wikimedia.org/r/355246

The previous patch was reverted, I am creating a separate one to allow to enable or disable the extra check at will (for megacli first).

root@prometheus1003:~$ python check-raid.py 
OK: optimal, 2 logical, 6 physical
OK
root@prometheus1003:~$ python check-raid.py --policy=WriteBack
CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough)
root@prometheus1003:~$ python check-raid.py --policy=WriteThrough
OK: optimal, 2 logical, 6 physical, WriteThrough policy
OK

Change 355246 merged by Jcrespo:
[operations/puppet@production] raid-check: optionally return critical when not in a write policy

https://gerrit.wikimedia.org/r/355246

Change 355249 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] [WIP]raid: Implement the option to check write cache policies

https://gerrit.wikimedia.org/r/355249

Change 355249 merged by Jcrespo:
[operations/puppet@production] raid: Implement the option to check write cache policies

https://gerrit.wikimedia.org/r/355249

Change 357994 had a related patch set uploaded (by Faidon Liambotis; owner: Faidon Liambotis):
[operations/puppet@production] Add a new raid::policy define

https://gerrit.wikimedia.org/r/357994

Change 357999 had a related patch set uploaded (by Faidon Liambotis; owner: Faidon Liambotis):
[operations/puppet@production] raid: add megacli default vs. current policy check

https://gerrit.wikimedia.org/r/357999

Marostegui triaged this task as Medium priority.Jun 12 2017, 8:16 AM
Marostegui changed the task status from Open to Stalled.Jun 27 2017, 1:03 PM
Marostegui moved this task from In progress to Blocked external/Not db team on the DBA board.

I have changed it to stalled as I don't think we are replacing its BBU anytime soon - it is a master.