Page MenuHomePhabricator

db1046 BBU looks faulty
Closed, ResolvedPublic

Description

db1046 has a faulty BBU:

Auto learn is disabled

root@db1046:~#  megacli -AdpBbuCmd -a0 | grep Auto-Learn
  Auto-Learn Mode: Warn via Event

And it looks degraded:

root@db1046:~#  megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4018 mV
Current: 0 mA
Temperature: 37 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : Yes
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No


GasGuageStatus:
  Fully Discharged        : No
  Fully Charged           : No
  Discharging             : Yes
  Initialized             : Yes
  Remaining Time Alarm    : No
  Discharge Terminated    : No
  Over Temperature        : No
  Charging Terminated     : No
  Over Charged            : No
Relative State of Charge: 9 %
Charger Status: Complete
Remaining Capacity: 18 mAh
Full Charge Capacity: 194 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 9 %
  Absolute State of charge: 1 %
  Remaining Capacity: 18 mAh
  Full Charge Capacity: 194 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: Battery is not being charged.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 32
Max Error = 0 %
Remaining Capacity Alarm = 170 mAh
Remining Time Alarm = 10 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 11/17, 2010
  Design Capacity: 1700 mAh
  Design Voltage: 3700 mV
  Specification Info: 33
  Serial Number: 5154
  Pack Stat Configuration: 0x0000
  Manufacture Name: SANYO
  Firmware Version   :
  Device Name: DLNU209
  Device Chemistry: LION
  Battery FRU: N/A
  Transparent Learn = 0
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Warn via Event

Because of that the RAID policy went to WriteThrough (which can affect performance):

  Auto-Learn Mode: Warn via Event
root@db1046:~#  megacli -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.633 TB
Sector Size         : 512
Mirror Data         : 1.633 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU

In some other cases we have seen that a manual relearn cycle fixes the issue (although temporarily as it comes back after a few hours/days/weeks) (see: T160731#3109104)
So I forced a relearn:

root@db1046:~#  megacli -AdpBbuCmd -BbuLearn -aALL -NoLog

Adapter 0: BBU Learn Succeeded.

Exit Code: 0x00

It is now recharging the battery slowly, but it is still degraded:

Relative State of Charge: 19 %
Absolute State of charge: 2 %

We can always force the policy to go back to WriteBack if we see performance issues. The command would be:

megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

I have not executed it as I would like @Ottomata and/or @elukey to review it (forcing the policy to WB is something we have done before and should not cause issues per se see: T166108)
Probably this issue has been like this for a long time and if we have not seen any performance issues, it is of course safer to leave it to WriteThrough

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 23 2017, 2:59 PM
Marostegui moved this task from Triage to In progress on the DBA board.May 24 2017, 8:22 AM
Nuria triaged this task as High priority.May 25 2017, 4:22 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.

Hello @Marostegui, thanks a lot for the heads up!

I checked megacli -AdpBbuCmd -a0 again and this is the status:

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 82 %
  Absolute State of charge: 9 %
  Remaining Capacity: 156 mAh
  Full Charge Capacity: 191 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: Battery is not being charged.
  Estimated Time to full recharge: Battery is not being charged.

And: Date of Manufacture: 11/17, 2010

I am not an expert in these things but my basic understanding is that write intensive workloads (like EL) would benefit from WriteBack and BBU, WriteThrough seems to be suboptimal in this use case. Would it be worth to replace the BBU? In our case it should simply be a matter of turning off eventlogging on eventlog1001 (it reads from Kafka so we can do it without loosing data) and possibly eventlogging_sync.sh on the slaves too when Chris does the work.

Does it make sense?

elukey claimed this task.May 26 2017, 10:08 AM
elukey added a project: User-Elukey.

Hello,

If you are planning to keep that host for a long time (which I assume so) - I would definitely replace the BBU. I think @Cmjohnson might have spares from the hosts we have decommissioned lately, whether those BBUs are much newer that I don't know. But I would say, change it, specially if it is not that hard to get that host off for some minutes.
Make sure you stop MySQL before shutting it down.

Everything you say is correct. We are decommissioning many <db1050 slaves, so there is a chance that Chris can get a better battery from an old server. Note this server is scheduled for replacement T156844, so whatever is going to be done is more a non-technical decision (maintenance window, worth replacing it depending on the purchase time) than a technical one.

Yes let's replace the BBU, will wait for a confirmation from @Cmjohnson then!

elukey moved this task from Backlog to In Progress on the User-Elukey board.May 26 2017, 12:43 PM
Ottomata added a comment.EditedMay 26 2017, 12:57 PM

How soon is T156844 likely to happen? Early next FY or later? If within Q1, I'd say let's just wait and replace the box. Otherwise, let's fix the BBU. Eh?

It is probably worth saying that the BBU might have been broken for a long time. We noticed because of the new check, but it would be too much of a coincidence that it broke that day. I am saying this because it might not affect performance a lot (otherwise we could have probably noticed before) - saying this, I do think it is worth replacing it if the host is going to be around for a long time still.

I agree with Manuel. while I would like to do the replacement ASAP, in reality it is not going to happen until Q2 or later.

for a long time still.

Agree but how long! It is slated for replacement next FY year sometime, right? Maybe we can just do it sooner rather than later?

jcrespo added a subscriber: Nuria.EditedMay 26 2017, 1:02 PM

The reasoning is that labsdb has priority, and it is even on the best interest of analytics to do that first, if I understood correctly CC @Nuria

elukey added a comment.EditedMay 26 2017, 1:06 PM

@Ottomata if Chris finds a BBU among the spare parts that we have I'd say that we can do it asap, it should be a relatively painless downtime for EL. If we need to buy a new one, let's maybe think about postponing until new hw arrives ok?

Another tip, once it is replaced (if it is) try to monitor its temperature once it boots up - in the last few weeks during some server moves we noticed some HW issues (specially on old hosts), this is one related to temperature on the BBU: T164107

This BBU failed again and the policy went back to WriteThrough:

Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU

I have forced a relearn cycle again and it got back to normal state:

˜/icinga-wm 7:56> RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

root@db1046:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4066 mV
Current: 180 mA
Temperature: 37 C
Battery State: Optimal

Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Nuria added a comment.Jun 5 2017, 4:09 PM

Since this is the master for eventlogging machine. Can we move the refresh for this host to happen sooner? (ping @jcrespo) https://phabricator.wikimedia.org/T156844

Not really, we have almost decided the goals for Q1, and they are all quite urgent and for hardware that has been already bought. What we can do to accelerate this is to buy them now, or as fast as possible (so they arrive on Q1) and set them up at the beginning of Q2, as it was more or less the plan (this is not for sure, it is my opinion of the best thing we could aim for).

elukey added a comment.Jun 6 2017, 7:00 AM

Not really, we have almost decided the goals for Q1, and they are all quite urgent and for hardware that has been already bought. What we can do to accelerate this is to buy them now, or as fast as possible (so they arrive on Q1) and set them up at the beginning of Q2, as it was more or less the plan (this is not for sure, it is my opinion of the best thing we could aim for).

Thanks for the update Jaime, ordering the hw in Q1 would be good enough for us. At the moment the only concrete issue is the BBU that causes WT, but we are not seeing a huge performance impact.

Not really, we have almost decided the goals for Q1, and they are all quite urgent and for hardware that has been already bought. What we can do to accelerate this is to buy them now, or as fast as possible (so they arrive on Q1) and set them up at the beginning of Q2, as it was more or less the plan (this is not for sure, it is my opinion of the best thing we could aim for).

Thanks for the update Jaime, ordering the hw in Q1 would be good enough for us. At the moment the only concrete issue is the BBU that causes WT, but we are not seeing a huge performance impact.

My only concern is that this BBU issues is just what came up now, and thankfully isn't impacting anything, but other issues might follow and could be more impacting (ie: raid controller dying).

elukey added a comment.Jun 6 2017, 8:14 AM

Sure I am concerned too, this is why I asked if it was possible to order the hardware as soon as possible to be ready to work on it by the end of Q1 :)

elukey added a comment.Jun 8 2017, 4:27 PM

@Cmjohnson sorry to ping :) Any idea if we have a spare BBU for db1046?

@elukey. Yes, I have another decommissioned r510 to take it from. Ping you
in a hour or so to replace

elukey added a comment.Jun 8 2017, 4:41 PM

@Cmjohnson thanks! Would it be possible to do the swap next week? Since this is an important DB I'd need to coordinate my team and Jaime/Manuel first.

Please, have a plan B just in case this host doesn't come back up, it is a very old server and we know that sometimes, old servers once powered off..never come back, or they do, but with worse problems.
So I would encourage Analytics to come up with a plan B just in case db1047 needs to take over as a master, make sure everything is in place and the steps to make this host a master are well-known to avoid large downtimes :-)

elukey added a comment.Jun 9 2017, 7:09 AM

Thanks @Marostegui, I didn't think the situation was so desperate :D

If there could be the risk of a bigger failure I'd change idea about the BBU and just go for setting WriteThrough, waiting for the new hardware. The failover plan needs to be addressed anyway because it will take a while before db1046 will be replaced.

I don't want to be pessimistic, but I have had issues with old servers in the past, so just wanted to give a heads up to make sure you guys have that in mind and a plan B :-)

@Marostegui we decided not to proceed with the BBU replacement, the risk it too high with a little gain. We are ok for the moment to use WriteThrough, there shouldn't be any issue with EventLogging with this setting.

Can we (as Analytics) drive the hardware request for the new db104[67] hosts with your help? (namely deciding the hw specs). This should free some workload from you, making also sure that we'll have new hw by the end of next quarter.

@Marostegui we decided not to proceed with the BBU replacement, the risk it too high with a little gain. We are ok for the moment to use WriteThrough, there shouldn't be any issue with EventLogging with this setting.

Sure, however it is now currently on WB because the BBU is now reporting Optimal again.

Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

It will go to WriteThrough once it fails again.

Can we (as Analytics) drive the hardware request for the new db104[67] hosts with your help? (namely deciding the hw specs). This should free some workload from you, making also sure that we'll have new hw by the end of next quarter.

Sure - that would work!
Shall we close this ticket then?

elukey closed this task as Resolved.Jun 12 2017, 10:47 AM

@Cmjohnson sorry for the extra pings, we don't need anymore the BBU replacement. Thanks a lot anyway!

Mentioned in SAL (#wikimedia-operations) [2017-06-19T06:37:05Z] <jynus> force learning cycle to db1046 controller T166141

Marostegui added a comment.EditedJun 19 2017, 9:23 AM

@elukey looks like the BBU is now almost completely dead. After Jaime's relearn attempt, almost 3 hours ago the battery status hasn't changed:

Charging Status              : Discharging
Voltage                                 : OK
Temperature                             : OK
Learn Cycle Requested	                  : Yes
Learn Cycle Active                      : Yes


Relative State of Charge: 8 %
Absolute State of charge: 1 %
Remaining Capacity: 16 mAh
Full Charge Capacity: 189 mAh
Run time to empty: 1 Min.
Average time to empty: 1 Min.
Estimated Time to full recharge: Battery is not being charged.

The policy is still on WriteThough:

root@db1046:~# megacli -LDInfo -Lall -aALL | grep Policy
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU

I will leave it like that as it is not causing obvious performance issues, apart from higher latency for writes (expected if you are not running WB):
https://grafana.wikimedia.org/dashboard/db/mysql?panelId=32&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1046&from=1497259285770&to=1497864085770

If in need to force WB this would be the way:

megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

This happened again

root@db1046:~# megacli -LDInfo -Lall -aALL | grep Policy
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU
BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

I have forced a relearn, we will see...

Mentioned in SAL (#wikimedia-operations) [2017-07-05T05:08:39Z] <marostegui> Force a relearn on db1046's BBU - T166141

And it recovered for now:

˜/icinga-wm 7:15> RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

root@db1046:~# megacli -LDInfo -Lall -aALL | grep Policy
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

BatteryType: BBU
Voltage: 4066 mV
Current: 182 mA
Temperature: 37 C
Battery State: Optimal
BBU Firmware Status:

It alerted again, but this time looks like the BBU is actually doing the learning:

BatteryType: BBU
Voltage: 3754 mV
Current: -674 mA
Temperature: 37 C
Battery State: Learning
BBU Firmware Status:

  Charging Status              : Discharging

I will leave it like that to see how it ends

˜/icinga-wm 11:05> RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy

Mentioned in SAL (#wikimedia-operations) [2017-10-23T15:59:13Z] <elukey> forced BBU learn cycle on db1046 - T166141

Mentioned in SAL (#wikimedia-operations) [2017-11-10T06:39:54Z] <marostegui> Force a BBU relearn on db1046 - T166141

After the BBU re-learn:

˜/icinga-wm 7:50> RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy