Page MenuHomePhabricator

db1047 BBU RAID issues (was: Investigate db1047 replication lag)
Closed, ResolvedPublic

Description

It can be TokuDB, pt-table-checksum, indexes, BBU, disks, ...

I have acked it for now.

It is probably the second BBU.

Event Timeline

It caught up before I was able to check it. I saw it 13 minutes delayed but
when I got it to the terminal it was already up to date.
Did you see which thread was lagging? S1 or S2? I will investigate further
tomorrow.
I ran pt table checksum today from 8AM to 6PM on several s2 wikis, but
there was no lag at all, so don't know if it could cause it later once it
got there. If it was I guess it would still be lagging as they ran for
several hours during the day.

Checked also the slow queries report for around those times (just a quick
glance) and all looked relates to INSERTS.

Quick look at the BBU reveals no issue there.
More to come tomorrow, server is fine now.

So, by looking at the binlogs I have seen that all activity related to pt-table-checksum finished at this time:

db1047-bin.005028 #170228 17:13:33

As I said yesterday, BBU looked good. Although I have seen this today:

Relative State of Charge: 97 %

And this is the number today:

Relative State of Charge: 95 %
Discharging             : Yes

However:

Battery State: Optimal

The policy though, looks fine (this was the same yesterday too):

root@db1047:/srv/sqldata# megacli -LDInfo -LAll -aAll | grep "Cache Policy:"
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU

And the RAID looks good:

root@db1047:/srv/sqldata# megacli -LDInfo -LAll -aAll | grep "State"
State               : Optimal
State               : Optimal
State               : Optimal

At around the time of the delay, there is a huge spike in inserts as can be seen in the image

screenshot-grafana-admin.wikimedia.org-2017-03-01-08-27-25.png (218×644 px, 44 KB)

Here we can see there're also slow queries running at around that time: https://tendril.wikimedia.org/report/slow_queries?host=%5Edb1047&user=&schema=&qmode=eq&query=&hours=24

I have been doing more pt-table-checksum runs this morning and I have stopped now as I am going for lunch. There are no pending transaction executions on db1047 so, I would discard pt-table-checksum for now.

jcrespo claimed this task.

Let's close it, I only opened as a a reminder if it continued the following day.

BBU status for Adapter: 1

BatteryType: BBU
Voltage: 3788 mV
Current: 0 mA
Temperature: 39 C
Battery State: Failed
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested                   : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : Yes
  Remaining Capacity Low                  : Yes
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No


GasGuageStatus:
  Fully Discharged        : No
  Fully Charged           : No
  Discharging             : Yes
  Initialized             : Yes
  Remaining Time Alarm    : No
  Discharge Terminated    : No
  Over Temperature        : No
  Charging Terminated     : No
  Over Charged            : No
Relative State of Charge: 4 %
Charger Status: Unknown
Remaining Capacity: 18 mAh
Full Charge Capacity: 494 mAh
isSOHGood: No
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 1

  Relative State of Charge: 4 %
  Absolute State of charge: 1 %
  Remaining Capacity: 18 mAh
  Full Charge Capacity: 494 mAh
  Run time to empty: Battery is not being charged.  
  Average time to empty: Battery is not being charged.  
  Estimated Time to full recharge: Battery is not being charged.  
  Cycle Count: 19
Max Error = 0 %
Remaining Capacity Alarm = 170 mAh
Remining Time Alarm = 10 Min

BBU Design Info for Adapter: 1

  Date of Manufacture: 12/04, 2011
  Design Capacity: 1700 mAh
  Design Voltage: 3700 mV
  Specification Info: 33
  Serial Number: 989
  Pack Stat Configuration: 0x0004
  Manufacture Name: SANYO
  Firmware Version   : 
  Device Name: DLGC9R0
  Device Chemistry: LION
  Battery FRU: N/A
  Battery FRU: N/A
  Transparent Learn = 0
  App Data = 0

BBU Properties for Adapter: 1

  Auto Learn Period: 90 Days
  Next Learn time: Sat Apr 29 00:18:08 2017
  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Enabled

Exit Code: 0x00
jcrespo added a subscriber: Cmjohnson.

We need to change the battery of the second controller (number 1) and disable auto-learning there (it was only disabled on number 0).

For the first part we need @Cmjohnson, next week.

jcrespo renamed this task from Investigate db1047 replication lag to db1047 BBU RAID issues (was: Investigate db1047 replication lag).Mar 2 2017, 12:48 PM
jcrespo removed jcrespo as the assignee of this task.
jcrespo triaged this task as Medium priority.
jcrespo updated the task description. (Show Details)

I have disabled the auto-learn mode for that controller - I have not set it to "2" (warn via an event) because we are not really using it:

root@db1047:~# echo "autoLearnMode=1" > disable_learn
root@db1047:~# megacli -AdpBbuCmd -SetBbuProperties -f disable_learn -a1

Adapter 1: Set BBU Properties Succeeded.

Exit Code: 0x00
root@db1047:~# megacli -AdpBbuCmd -a1 | grep Auto-Learn
  Auto-Learn Mode: Disabled
This comment has been deleted.

@Cmjohnson once you are back in the DC can you check if you have any spare BBU?
Thanks!

Marostegui changed the task status from Open to Stalled.Mar 17 2017, 2:21 PM

Let's block this as db1047 might be decommissioned soon as per: T156844

@Marostegui There are a few decom db's now I could swap out the bbu if you like or just proceed with the decom process. Let me know your preference.

Hey @Cmjohnson! let's wait to see if that ticket keeps progressing for now, if the server is going to get decommissioned it would be just a waste of time to swap its BBU now.
I will keep you posted!
Thanks!

@Marostegui precisely on that ticket they are discussing when they will be able to decom it, and it is not going to happen for months as it looks.

As much as I would love to take away this machine, I would be realistic (or pesimisstic! O:-) ) and change it (of course, with no urgency). I thought the batteries were soldered into the board or something, but it turned out, based on how much time it took chris to change it, it must not be that complex (I would let Chris confirm that).

The alternative is getting an alert during the weekend every week.

@Marostegui precisely on that ticket they are discussing when they will be able to decom it, and it is not going to happen for months as it looks.

I had the impression that for db1047 it was going to be pretty fast as it looks from the comments, maybe @Ottomata can give us some more accurated info?
@Ottomata you think db1047 is going to be gone soonish?
I am not talking about db1046 or dbstore1002, just db1047.

Thanks!

Given the responses so far, I think we will be able to decom it soon. But, we should wait a while (maybe a week) to collect more feedback to be sure. In either case, we are going to have to refresh that hardware soon, right? If we have no objections within a week, I think we can just decom it asap.

Let's wait then to see how that ticket progress next week or so in order not to make Chris to replace it and then a few days later decom the server :)
Thanks for the fast answer!

@Marostegui Same thing as db1048? I can use a spare bbu from a decom server if you like or is this server nearing it's last days?

@Cmjohnson it was supposed to be killed soon, but @Ottomata believes it will take a bit longer, so maybe it is worth replacing the BBU.
@Ottomata how difficult would be to arrange a day for this host to be down for a few minutes so Chris can replace its faulty BBU?

Not difficult at all. I think this server is not used often, only really when there are issues with dbstore1002. @Cmjohnson, let me know what day is good for you and I'll send an email out.

@Ottomata Let's schedule for Wednesday next week @10am EST.

@Cmjohnson Cool, I've announced the date. Let's do it.

This wednesday is the failover, do you really want to do it then? We may need Chris or me to put things down and we may be unavailable?

OO interesting. Yeah @Cmjohnson that IS a bad time. When else do you wanna?

@Ottomata: is there a better time this week or do you push it out to next week?

As long as we give a couple of days heads up, I think we're fine. Pick a day any day :) Just let me know with enough time to get an email out.

@Ottomata remember to: stop all slaves; before shutting down MySQL (not a hard requirement, but just in case there is a transaction hanging, better to be careful than needed to reclone that host!)

@Cmjohnson, I'll be on vacation next week, and then at the analytics offsite the following. Coordinate with @elukey if you want to do it next week, otherwise let's set a date after May 22nd.

@Ottomata: is there a better time this week or do you push it out to next week? Also, whatever we change this out with will probably not last long either. It appears the bbu's for this server class and age are going very quickly. This is just a temporary fix a replacement server is needed sooner rather than later.

@Cmjohnson we just need to alert people a couple of days in advance, nothing more. Do you have a preferred date/time?

Mentioned in SAL (#wikimedia-operations) [2017-06-01T17:02:19Z] <elukey> sto mysql, eventlogging_sync and shutdown db1047 (analytics-store) for maintenance - T159266

Was the BBU replaced yesterday in the end?

root@db1047:~# megacli -AdpBbuCmd  -a1

BBU status for Adapter: 1

BatteryType: BBU
Voltage: 3856 mV
Current: 0 mA
Temperature: 39 C
Battery State: Failed
BBU Firmware Status:

I have forced a relearn for it in case it wasn't forced after its replacement. I will report back in a while to see if it started to charge

root@db1047:~# megacli -AdpBbuCmd  -a1

BBU status for Adapter: 1

BatteryType: BBU
Voltage: 3857 mV
Current: 0 mA
Temperature: 39 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes

I think this is not going to get any better - probably not worth spending more time on this host if it is going to be decommissioned at some point soon.

root@db1047:~# megacli -AdpBbuCmd  -a1

BBU status for Adapter: 1

BatteryType: BBU
Voltage: 4078 mV
Current: 59 mA
Temperature: 39 C
Battery State: Failed
BBU Firmware Status:

  Charging Status              : Charging
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : Yes

 Charging Terminated     : Yes
  Over Charged            : Yes
Relative State of Charge: 100 %
Charger Status: In Progress
Remaining Capacity: 451 mAh
Full Charge Capacity: 451 mAh
isSOHGood: No
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 1

  Relative State of Charge: 100 %
  Absolute State of charge: 27 %
  Remaining Capacity: 451 mAh
  Full Charge Capacity: 451 mAh
  Run time to empty: Battery is not being charged.
  Average time to empty: Battery is not being charged.

@elukey @Ottomata I will leave this up to you if you really think we should try to get another BBU...
I have requested another learning cycle to see if it helps in anyways, but I doubt it.

BBU status for Adapter: 1

BatteryType: BBU
Voltage: 4079 mV
Current: 57 mA
Temperature: 39 C
Battery State: Degraded(Need Attention)
		A manual learn is required.
BBU Firmware Status:

  Charging Status              : Charging
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : Yes

I agree with Manuel assessment. Maybe the only thing to discuss is if to force WB, or accept that it will have bad performance at times. Maybe leaving this open until db1047 is substituted.

I think it is already forced:

root@db1047:~# megacli -LDInfo -LAll -a1 | grep "Cache Policy:"
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU

Cool thanks, so they should be aware of the dangers of that :-)

@elukey might have other opinions, but I'm inclined to try our best to expedite the ordering of new hardware, rather than worry about the BBU. If we lost db1047, analytics wouldn't lose any data, as it is in db1046, in HDFS, and also in Kafka.

@elukey might have other opinions, but I'm inclined to try our best to expedite the ordering of new hardware, rather than worry about the BBU. If we lost db1047, analytics wouldn't lose any data, as it is in db1046, in HDFS, and also in Kafka.

+1, we already tried to replace the BBU and it didn't work, so I don't think it is worth to spend more time on it.

Let's close this then for now as nothing will be done at this point (and I agree with what you guys think - not worth)