Page MenuHomePhabricator

Decom db1048 (BBU Faulty - slave lagging)
Closed, DuplicatePublic

Description

Let's get rid of db1048 and setup a substitute for a phabricator slave

db1048 is lagging behind:

Seconds_Behind_Master: 7026

The reason for it looks like a faulty BBU which has changed the RAID policy to WriteThrough

root@db1048:~# megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
root@db1048:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4025 mV
Current: 0 mA
Temperature: 32 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
  Discharging             : Yes
  Relative State of Charge: 31 %
  Absolute State of charge: 3 %

Details

Related Gerrit Patches:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 17 2017, 7:29 AM
Marostegui renamed this task from db1048 BBU broken - slave lagging to db1048 BBU Faulty - slave lagging.Mar 17 2017, 7:36 AM

I have manually forced a BBU learn cycle and it is now looking fine:

root@db1048:~#  megacli -AdpBbuCmd -BbuLearn -aALL -NoLog

Adapter 0: BBU Learn Succeeded.
root@db1048:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4055 mV
Current: 166 mA
Temperature: 32 C
Battery State: Optimal
Relative State of Charge: 34 %
Charger Status: In Progress
root@db1048:~# megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

The slave is catching up with no problems after the policy has changed to WB again:

Seconds_Behind_Master: 3756

And I have disabled the auto-learn cycle just in case:

root@db1048:~# echo "autoLearnMode=1" > disable_learn
root@db1048:~# megacli -AdpBbuCmd -a0 | grep Auto-Learn
  Auto-Learn Mode: Warn via Event
root@db1048:~# megacli -AdpBbuCmd -SetBbuProperties -f disable_learn -a0

Adapter 0: Set BBU Properties Succeeded.

Exit Code: 0x00
root@db1048:~#  megacli -AdpBbuCmd -a0 | grep Auto-Learn
  Auto-Learn Mode: Disabled

I will close this ticket for now, but at least we have it for the future, if we see this failing again on this host.

Marostegui closed this task as Resolved.Mar 17 2017, 7:42 AM
Marostegui claimed this task.

Do you think we should force a learning cycle to db1047 T159266 ?

I just tried - we will see!

But db1047 one has a different (and more worrying error) for BBU a1:

Battery State: Failed

db1047's BBU is acting weirdly
It goes from Failed -> Charging -> Failed
It is acting very weirdly, it has gone from

Relative State of Charge: 4 %
Charger Status: Unknown

To:

Relative State of Charge: 100 %
Charger Status: In Progress

But still marking it as Failed.
I will leave it like that and check it in a few hours.

Unfortunately, db1047's BBU looks totally broken, it is not making any sense in what it reports. Some places it says it is fully charged, some others don't, and it is always reporting FAILED.
It was worth trying but we will need to stick to T159266

Marostegui reopened this task as Open.Mar 30 2017, 6:04 AM
Marostegui added a subscriber: Cmjohnson.

This has happened again, so maybe the BBU is indeed faulty.

root@db1048:~# date ; mysql --skip-ssl -e "show slave status\G" | grep Seconds
Thu Mar 30 05:59:21 UTC 2017
        Seconds_Behind_Master: 14268

Policy back to WriteThrough:

root@db1048:~#  megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
root@db1048:~#  megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 23 %
  Absolute State of charge: 2 %
BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

After forcing a relearn cycle, which is what we did last time, makes it better for a bit:

root@db1048:~# megacli -AdpBbuCmd -BbuLearn -aALL -NoLog

Adapter 0: BBU Learn Succeeded.
root@db1048:~#  megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4021 mV
Current: 0 mA
Temperature: 34 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
  Learn Cycle Requested	                  : Yes

Note it now says degraded instead of unknown and it is now charging:

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 26 %
  Absolute State of charge: 3 %

I believe this is broken anyways.
@Cmjohnson do we have spare BBUs from some other hosts that we have decommissioned?
I am going to keep an eye on this host to see if once it finished the policy goes back to WB and the lag recovers

˜/icinga-wm 8:27> RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
Battery State: Optimal
root@db1048:~#  megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

I believe there was yesterday maintenance or trouble on Phabricator. I would ask RelEng first.

Yep, the deployment page said there was a phabricator update so maybe that put more stress on the server and made the BBU fail (again)?
Because the fact that the policy went back to "safe" mode twice already...

Paladox edited subscribers, added: mmodell; removed: 20after4.Mar 30 2017, 2:29 PM
Paladox added a subscriber: Paladox.
Marostegui moved this task from Triage to In progress on the DBA board.Mar 31 2017, 3:52 PM

This has happened again:

˜/icinga-wm 17:47> PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.00 seconds

root@db1048:~#  megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
root@db1048:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 33 %
  Absolute State of charge: 3 %

Forced a relearn and the BBU updated its state:

root@db1048:~#  megacli -AdpBbuCmd -BbuLearn -aALL -NoLog

Adapter 0: BBU Learn Succeeded.

Exit Code: 0x00
root@db1048:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4029 mV
Current: 0 mA
Temperature: 33 C
Battery State: Degraded(Need Attention)

And it is now charging slowly compared to previous values:

root@db1048:~# megacli -AdpBbuCmd  -a0 | egrep "Relative|Absolute"
Relative State of Charge: 36 %
  Relative State of Charge: 36 %
  Absolute State of charge: 4 %

And it recovered:

root@db1048:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4058 mV
Current: 152 mA
Temperature: 33 C
Battery State: Optimal
root@db1048:~#  megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

is there anything I need to be doing for this?

Do you have any spare BBU available?

@Marostegui yes, I can use one from a decommissioned server.

@Cmjohnson thanks. Let me coordinate this and we will arrange one day to do the swap.
@mmodell is there any problem if we take db1048 down for a few minutes to replace its faulty BBU? ( it is the phabricator slave, but as far as I know phabricator doesn't do master-slave query separation)

There are some reports running on the slave- We should point the slave to the master to avoid activity there thought the dns alias.

@Marostegui: correct, phabricator isn't currently querying the slave, other than the reports mentioned by @jcrespo.

Great, we can change the DNS and that's should be it! Thanks!

This just happened again:

root@db1048:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 26 %
  Absolute State of charge: 3 %
  Remaining Capacity: 43 mAh
  Full Charge Capacity: 167 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: Battery is not being charged.
  Estimated Time to full recharge: Battery is not being charged.
root@db1048:~#  megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

I have forced the relearn cycle.

I am going to get the dns changed so we can schedule a day to replace the BBU

Change 352769 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Point m3 slave to codfw master

https://gerrit.wikimedia.org/r/352769

And after the manual relearn it is back to normal state:

root@db1048:~#  megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

Change 352769 merged by Marostegui:
[operations/dns@master] wmnet: Point m3 slave to codfw master

https://gerrit.wikimedia.org/r/352769

@Marostegui Let me know if you want to do the bbu swap today?

Mentioned in SAL (#wikimedia-operations) [2017-05-09T14:09:33Z] <marostegui> Stop MySQL and shutdown db1048 (phabricator slave) to replace BBU - T160731

Marostegui closed this task as Resolved.May 9 2017, 2:35 PM

@Cmjohnson has changed the battery, we will see how it goes.

root@db1048:~# megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4075 mV
Current: 0 mA
Temperature: 32 C
Battery State: Optimal

Fully Charged           : Yes
  Discharging             : Yes
  Initialized             : Yes

Charging Terminated     : Yes
  Over Charged            : No
Relative State of Charge: 100 %
Charger Status: Complete
Remaining Capacity: 248 mAh
Full Charge Capacity: 248 mAh
isSOHGood: Yes

I have turned off the AutoLearn again:

root@db1048:~# megacli -AdpBbuCmd -a0 | grep Auto-Learn
  Auto-Learn Mode: Enabled
root@db1048:~# echo "autoLearnMode=1" > disable_learn
root@db1048:~# megacli -AdpBbuCmd -SetBbuProperties -f disable_learn -a0

Adapter 0: Set BBU Properties Succeeded.

Exit Code: 0x00
root@db1048:~# megacli -AdpBbuCmd -a0 | grep Auto-Learn
  Auto-Learn Mode: Disabled

I am going to close this as resolved for now, and if it happens again, I will reopen it.

I am going to leave m3-slave pointing to the codfw master, until tomorrow just in case. If the host goes fine overnight, I will revert the change.

Volans reopened this task as Open.May 27 2017, 10:17 AM
Volans added a subscriber: Volans.

Re-opening as it alarmed again today for the write policy... the battery is reported to be from 2010, was not swapped few days ago?

$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL

BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
$ sudo megacli -AdpBbuCmd -GetBbuCapacityInfo -aAll


BBU Capacity Info for Adapter: 0

  Relative State of Charge: 23 %
  Absolute State of charge: 3 %
  Remaining Capacity: 56 mAh
  Full Charge Capacity: 248 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: Battery is not being charged.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 36
Max Error = 0 %
Remaining Capacity Alarm = 170 mAh
Remining Time Alarm = 10 Min
$ sudo megacli -AdpBbuCmd  -a0

BBU status for Adapter: 0

BatteryType: BBU
Battery State: Unknown
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 23 %
  Absolute State of charge: 3 %
  Remaining Capacity: 56 mAh
  Full Charge Capacity: 248 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: Battery is not being charged.
  Estimated Time to full recharge: Battery is not being charged.
  Cycle Count: 36
Max Error = 0 %
Remaining Capacity Alarm = 170 mAh
Remining Time Alarm = 10 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 11/17, 2010
  Design Capacity: 1700 mAh
  Design Voltage: 3700 mV
  Specification Info: 33
  Serial Number: 5491
  Pack Stat Configuration: 0x0000
  Manufacture Name: SANYO
  Firmware Version   :
  Device Name: DLNU209
  Device Chemistry: LION
  Battery FRU: N/A
  Transparent Learn = 0
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled
Volans added a comment.EditedMay 27 2017, 10:23 AM

So far the lag is limited to 3~4 seconds according to tendril, while from Grafana is flat zero, maybe the dashboard is not graphing the right data?
See db1048 replication lag dashboard.

It was swapped a few weeks ago, but I guess the new one is also pretty old as it comes from hosts previously decommissioned - right @Cmjohnson ?

And db1048 returned to WriteBack policy less than 1h ago 😛

Same behaviour as we have seen before with faulty BBUs :-(

@Volans and @Marostegui I can do this as soon as you give me the word go but keep in mind this is only going to be temporary. the bbu's for this class and age of server are going out in record numbers. Please think about a replacement server sooner rather than later.

Thanks Chris, I will have this ready for tomorrow so we can do it tomorrow if that works for you?
We are aware that this will happen again, we are trying to get rid of most of the old hosts (those <db1050 - T134476). But it will take some time.

Great! ping when I can do the swap.

Change 356334 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Point m3 slave to codfw master

https://gerrit.wikimedia.org/r/356334

Change 356334 merged by Marostegui:
[operations/dns@master] wmnet: Point m3 slave to eqiad master

https://gerrit.wikimedia.org/r/356334

Mentioned in SAL (#wikimedia-operations) [2017-05-31T13:08:09Z] <marostegui> Stop MySQL on db1048 and shutdown the host for maintenance - T160731

@Cmjohnson db1048 is now down and ready for you to swap the BBU

Thanks!

replaced the battery with a well used one from a decom'd db. Hopefully this will work for long enough. Server has been powered on

Thanks Chris.
The battery is now charging

Battery State: Optimal
BBU Firmware Status:

  Charging Status              : Charging
  Relative State of Charge: 29 %
  Absolute State of charge: 9 %
  Remaining Capacity: 156 mAh
  Full Charge Capacity: 535 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: Battery is not being charged.
  Estimated Time to full recharge: 1 Hour, 43 Min.
Marostegui closed this task as Resolved.May 31 2017, 3:40 PM

I will mark this as resolve again and let's see how long it lasts

root@db1048:~#  megacli -ldinfo -l0 -a0 | grep Policy
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

I will revert the DNS patch tomorrow morning once the battery has recharged and all that.

The battery looks good now, it recharged, the temperature is ok and I have disabled the auto learn.
I have started MySQL and once it catches up I will merge te DNS revert

jcrespo reopened this task as Open.Aug 30 2017, 10:22 AM

I think this happened again yesterday- I will just modify this task into a decommissioning one.

jcrespo renamed this task from db1048 BBU Faulty - slave lagging to Decom db1048 (BBU Faulty - slave lagging).Aug 30 2017, 10:23 AM
jcrespo removed projects: Patch-For-Review, ops-eqiad.
jcrespo updated the task description. (Show Details)
jcrespo removed subscribers: Volans, Stashbot, gerritbot and 4 others.
Cmjohnson triaged this task as Low priority.Sep 5 2017, 7:30 PM

@Cmjohnson We are going to decom db1048 (but we are not ready yet), please do not take any action here, we will just clone it and ask you to unrack it. Opened for DBA tracking purposes only.

jcrespo moved this task from Blocked external/Not db team to Next on the DBA board.Sep 5 2017, 7:35 PM

no worries, I was just moving it to a lower priority for me..I am couple of weeks away from tacking decom's

Not yet, this is still in use.