It can be TokuDB, pt-table-checksum, indexes, BBU, disks, ...
I have acked it for now.
It is probably the second BBU.
jcrespo | |
Feb 28 2017, 8:03 PM |
F5951035: screenshot-grafana-admin.wikimedia.org-2017-03-01-08-27-25.png | |
Mar 1 2017, 7:31 AM |
It can be TokuDB, pt-table-checksum, indexes, BBU, disks, ...
I have acked it for now.
It is probably the second BBU.
It caught up before I was able to check it. I saw it 13 minutes delayed but
when I got it to the terminal it was already up to date.
Did you see which thread was lagging? S1 or S2? I will investigate further
tomorrow.
I ran pt table checksum today from 8AM to 6PM on several s2 wikis, but
there was no lag at all, so don't know if it could cause it later once it
got there. If it was I guess it would still be lagging as they ran for
several hours during the day.
Checked also the slow queries report for around those times (just a quick
glance) and all looked relates to INSERTS.
Quick look at the BBU reveals no issue there.
More to come tomorrow, server is fine now.
So, by looking at the binlogs I have seen that all activity related to pt-table-checksum finished at this time:
db1047-bin.005028 #170228 17:13:33
As I said yesterday, BBU looked good. Although I have seen this today:
Relative State of Charge: 97 %
And this is the number today:
Relative State of Charge: 95 % Discharging : Yes
However:
Battery State: Optimal
The policy though, looks fine (this was the same yesterday too):
root@db1047:/srv/sqldata# megacli -LDInfo -LAll -aAll | grep "Cache Policy:" Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
And the RAID looks good:
root@db1047:/srv/sqldata# megacli -LDInfo -LAll -aAll | grep "State" State : Optimal State : Optimal State : Optimal
At around the time of the delay, there is a huge spike in inserts as can be seen in the image
Here we can see there're also slow queries running at around that time: https://tendril.wikimedia.org/report/slow_queries?host=%5Edb1047&user=&schema=&qmode=eq&query=&hours=24
I have been doing more pt-table-checksum runs this morning and I have stopped now as I am going for lunch. There are no pending transaction executions on db1047 so, I would discard pt-table-checksum for now.
BBU status for Adapter: 1 BatteryType: BBU Voltage: 3788 mV Current: 0 mA Temperature: 39 C Battery State: Failed BBU Firmware Status: Charging Status : None Voltage : OK Temperature : OK Learn Cycle Requested : No Learn Cycle Active : No Learn Cycle Status : OK Learn Cycle Timeout : No I2c Errors Detected : No Battery Pack Missing : No Battery Replacement required : Yes Remaining Capacity Low : Yes Periodic Learn Required : No Transparent Learn : No No space to cache offload : No Pack is about to fail & should be replaced : No Cache Offload premium feature required : No Module microcode update required : No GasGuageStatus: Fully Discharged : No Fully Charged : No Discharging : Yes Initialized : Yes Remaining Time Alarm : No Discharge Terminated : No Over Temperature : No Charging Terminated : No Over Charged : No Relative State of Charge: 4 % Charger Status: Unknown Remaining Capacity: 18 mAh Full Charge Capacity: 494 mAh isSOHGood: No Battery backup charge time : 0 hours BBU Capacity Info for Adapter: 1 Relative State of Charge: 4 % Absolute State of charge: 1 % Remaining Capacity: 18 mAh Full Charge Capacity: 494 mAh Run time to empty: Battery is not being charged. Average time to empty: Battery is not being charged. Estimated Time to full recharge: Battery is not being charged. Cycle Count: 19 Max Error = 0 % Remaining Capacity Alarm = 170 mAh Remining Time Alarm = 10 Min BBU Design Info for Adapter: 1 Date of Manufacture: 12/04, 2011 Design Capacity: 1700 mAh Design Voltage: 3700 mV Specification Info: 33 Serial Number: 989 Pack Stat Configuration: 0x0004 Manufacture Name: SANYO Firmware Version : Device Name: DLGC9R0 Device Chemistry: LION Battery FRU: N/A Battery FRU: N/A Transparent Learn = 0 App Data = 0 BBU Properties for Adapter: 1 Auto Learn Period: 90 Days Next Learn time: Sat Apr 29 00:18:08 2017 Learn Delay Interval:0 Hours Auto-Learn Mode: Enabled Exit Code: 0x00
We need to change the battery of the second controller (number 1) and disable auto-learning there (it was only disabled on number 0).
For the first part we need @Cmjohnson, next week.
I have disabled the auto-learn mode for that controller - I have not set it to "2" (warn via an event) because we are not really using it:
root@db1047:~# echo "autoLearnMode=1" > disable_learn root@db1047:~# megacli -AdpBbuCmd -SetBbuProperties -f disable_learn -a1 Adapter 1: Set BBU Properties Succeeded. Exit Code: 0x00 root@db1047:~# megacli -AdpBbuCmd -a1 | grep Auto-Learn Auto-Learn Mode: Disabled
@Cmjohnson once you are back in the DC can you check if you have any spare BBU?
Thanks!
@Marostegui There are a few decom db's now I could swap out the bbu if you like or just proceed with the decom process. Let me know your preference.
Hey @Cmjohnson! let's wait to see if that ticket keeps progressing for now, if the server is going to get decommissioned it would be just a waste of time to swap its BBU now.
I will keep you posted!
Thanks!
@Marostegui precisely on that ticket they are discussing when they will be able to decom it, and it is not going to happen for months as it looks.
As much as I would love to take away this machine, I would be realistic (or pesimisstic! O:-) ) and change it (of course, with no urgency). I thought the batteries were soldered into the board or something, but it turned out, based on how much time it took chris to change it, it must not be that complex (I would let Chris confirm that).
The alternative is getting an alert during the weekend every week.
Given the responses so far, I think we will be able to decom it soon. But, we should wait a while (maybe a week) to collect more feedback to be sure. In either case, we are going to have to refresh that hardware soon, right? If we have no objections within a week, I think we can just decom it asap.
Let's wait then to see how that ticket progress next week or so in order not to make Chris to replace it and then a few days later decom the server :)
Thanks for the fast answer!
@Marostegui Same thing as db1048? I can use a spare bbu from a decom server if you like or is this server nearing it's last days?
@Cmjohnson it was supposed to be killed soon, but @Ottomata believes it will take a bit longer, so maybe it is worth replacing the BBU.
@Ottomata how difficult would be to arrange a day for this host to be down for a few minutes so Chris can replace its faulty BBU?
Not difficult at all. I think this server is not used often, only really when there are issues with dbstore1002. @Cmjohnson, let me know what day is good for you and I'll send an email out.
This wednesday is the failover, do you really want to do it then? We may need Chris or me to put things down and we may be unavailable?
As long as we give a couple of days heads up, I think we're fine. Pick a day any day :) Just let me know with enough time to get an email out.
@Ottomata remember to: stop all slaves; before shutting down MySQL (not a hard requirement, but just in case there is a transaction hanging, better to be careful than needed to reclone that host!)
@Cmjohnson, I'll be on vacation next week, and then at the analytics offsite the following. Coordinate with @elukey if you want to do it next week, otherwise let's set a date after May 22nd.
@Ottomata: is there a better time this week or do you push it out to next week? Also, whatever we change this out with will probably not last long either. It appears the bbu's for this server class and age are going very quickly. This is just a temporary fix a replacement server is needed sooner rather than later.
@Cmjohnson we just need to alert people a couple of days in advance, nothing more. Do you have a preferred date/time?
Mentioned in SAL (#wikimedia-operations) [2017-06-01T17:02:19Z] <elukey> sto mysql, eventlogging_sync and shutdown db1047 (analytics-store) for maintenance - T159266
Was the BBU replaced yesterday in the end?
root@db1047:~# megacli -AdpBbuCmd -a1 BBU status for Adapter: 1 BatteryType: BBU Voltage: 3856 mV Current: 0 mA Temperature: 39 C Battery State: Failed BBU Firmware Status:
I have forced a relearn for it in case it wasn't forced after its replacement. I will report back in a while to see if it started to charge
root@db1047:~# megacli -AdpBbuCmd -a1 BBU status for Adapter: 1 BatteryType: BBU Voltage: 3857 mV Current: 0 mA Temperature: 39 C Battery State: Degraded(Need Attention) A manual learn is required. BBU Firmware Status: Charging Status : None Voltage : OK Temperature : OK Learn Cycle Requested : Yes
I think this is not going to get any better - probably not worth spending more time on this host if it is going to be decommissioned at some point soon.
root@db1047:~# megacli -AdpBbuCmd -a1 BBU status for Adapter: 1 BatteryType: BBU Voltage: 4078 mV Current: 59 mA Temperature: 39 C Battery State: Failed BBU Firmware Status: Charging Status : Charging Voltage : OK Temperature : OK Learn Cycle Requested : No Learn Cycle Active : No Learn Cycle Status : OK Learn Cycle Timeout : No I2c Errors Detected : No Battery Pack Missing : No Battery Replacement required : Yes Charging Terminated : Yes Over Charged : Yes Relative State of Charge: 100 % Charger Status: In Progress Remaining Capacity: 451 mAh Full Charge Capacity: 451 mAh isSOHGood: No Battery backup charge time : 0 hours BBU Capacity Info for Adapter: 1 Relative State of Charge: 100 % Absolute State of charge: 27 % Remaining Capacity: 451 mAh Full Charge Capacity: 451 mAh Run time to empty: Battery is not being charged. Average time to empty: Battery is not being charged.
@elukey @Ottomata I will leave this up to you if you really think we should try to get another BBU...
I have requested another learning cycle to see if it helps in anyways, but I doubt it.
BBU status for Adapter: 1 BatteryType: BBU Voltage: 4079 mV Current: 57 mA Temperature: 39 C Battery State: Degraded(Need Attention) A manual learn is required. BBU Firmware Status: Charging Status : Charging Voltage : OK Temperature : OK Learn Cycle Requested : Yes
I agree with Manuel assessment. Maybe the only thing to discuss is if to force WB, or accept that it will have bad performance at times. Maybe leaving this open until db1047 is substituted.
I think it is already forced:
root@db1047:~# megacli -LDInfo -LAll -a1 | grep "Cache Policy:" Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
@elukey might have other opinions, but I'm inclined to try our best to expedite the ordering of new hardware, rather than worry about the BBU. If we lost db1047, analytics wouldn't lose any data, as it is in db1046, in HDFS, and also in Kafka.
+1, we already tried to replace the BBU and it didn't work, so I don't think it is worth to spend more time on it.
Let's close this then for now as nothing will be done at this point (and I agree with what you guys think - not worth)