We received reports of high replag (~3-4 hours) on wikireplica-analytics (https://tools.wmflabs.org/replag/) over the holidays. I had poked @Marostegui before, and he mentioned that it was due to heavy queries on the server since the dns switchover to the new servers, and mentioned re instituting query killer to make this better. The Cloud Services team discussed it in our weekly meeting today, and we are for it, let us know what we can do to help make this happen, thank you!
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | jcrespo | T140788 Labs databases rearchitecture (tracking) | |||
Resolved | bd808 | T166402 Program 7 Outcome 3: data services | |||
Resolved | bd808 | T142807 Migrate all users to new Wiki Replica cluster and decommission old hardware | |||
Resolved | • Banyek | T183983 Re-institute query killer for the analytics WikiReplica | |||
Resolved | • Banyek | T203674 Debian package or files managed my puppet for pt-kill-wmf |
Event Timeline
Hi,
Yeah, we have been seeing long delays on the analytics replicate for sometime already.
Some days before Christmas Jaime and myself were discussing whether we should enable the query killer again to avoid those super long connections and prevent delays. I believe we were almost sure about it, we needed to talk to you guys.
There are still some questions that need to be answered
- How long should the query killer allow connections to run for on the analytics replicas? 6 hours?
- If we set up a query killer on the analytics replica, we must set it too on the web replica. That one needs to be a lot more aggressive. What would be a good threshold for that one? 5 minutes?
Thoughts?
The thing is, there was already a query killer during Christmas, analytics being for over 28800 seconds and web over 3600, but they did not kill many queries (~10 each in 2 weeks).
I recall killing some long running queries on labsdb1010 during Christmas, I cannot recall if they have been running for longer than 28800 seconds or not though, so we might need to check if they are indeed working as expected.
@madhuvishy we need input from you guys about what a reasonable query killer threshold would be for both web and analytics host.
We have been discussing also another track to explore, which is setting up active-active on the labs hosts, meaning that the host that is currently on standby could also be active and serve analytics.
Even if we go for that, we still need the query killer with some more aggressive settings given that 8 hours doesn't really kill many, so maybe it needs to be decreased a bit.
What were the old query killer thresholds? We could start with those and add 50% to see how it goes.
What were the old query killer thresholds
It depended on the load, at times it was as low as 1 hour- when on high load-, at other times it was 4 hours. The whole point of separating analytics vs. web is so that people that needs up to date results use the web endpoint, and people that need long-running queries use the analytics one (which could have lag).
yep understood, the way I'm interpreting is that there is no objective right answer here. At first blush I would be inclined to take queues from that. Web query killer at 1h and analytics at 4h and we see how it goes.
edit: 1h even seems high for the web but I guess we'll find out if it's actually a problem?
Web query killer at 1h and analytics at 4h and we see how it goes.
I've implemented that, unpuppetized waiting for your feedback. When happy, I will puppetize and set a couple of hiera keys that you could use to modify the values as you wish.
I have not seen any more delays on the wiki replicas since this was set up, so those thresholds are looking pretty good!
We are seeing that some queries on Execute state are not being killed.
Playing around with the current way of running pt-kill:
pt-kill --print --kill --victims all --interval 10 --busy-time 14400 --match-command Query --match-user '^[spu][0-9]' F=/dev/null -S /run/mysqld/mysqld.sock
We realised that using either --match-command 'Query|Execute' or even --match-command Execute wasn't working because all queries in Execute state, even new queries, were being killed.
It looked like --busy-time was being ignored, and it looks it is, indeed a bug: https://jira.percona.com/browse/PT-167
The bug report states that it has been fixed on 3.0.5, but 3.0.6 still doesn't work.
I have applied what the user suggested as a patch and looks like it works fine - I have left it running only on labsdb1009 as testing to make sure nothing else is being killed - so we'll see how it works in the next few days.
For the record this is what is running:
This is the original run:
pt-kill --print --kill --victims all --interval 10 --busy-time 14400 --match-command Query --match-user '^[spu][0-9]' F=/dev/null -S /run/mysqld/mysqld.sock
And this is what is running now on labsdb1009:
/home/marostegui/pt-kill-patched --print --kill --victims all --interval 10 --busy-time 14400 --match-command 'Query|Execute' --match-user '^[spu][0-9]' F=/dev/null -S /run/mysqld/mysqld.sock
And this is the patch:
root@labsdb1009:~# diff -u /usr/bin/pt-kill /home/marostegui/pt-kill-patched --- /usr/bin/pt-kill 2017-01-02 13:51:07.000000000 +0000 +++ /home/marostegui/pt-kill-patched 2018-02-19 16:34:58.412981273 +0000 @@ -3512,7 +3512,7 @@ next QUERY; } - if ( $find_spec{busy_time} && ($query->{Command} || '') eq 'Query' ) { + if ( $find_spec{busy_time} && ($query->{Command} || '') ne 'Sleep' ) { next QUERY unless defined($query->{Time}); if ( $query->{Time} < $find_spec{busy_time} ) { PTDEBUG && _d("Query isn't running long enough");
Reporting status:
So far only queries on Query state and running longer than 14400 seconds have been killed over night (two queries) so, so far so good.
labsdb1009 has killed a couple of legit queries, so I have applied the same patch to labsdb1010 and labsdb1011 and leave it running to get a higher sample of queries and see if it is indeed working fine (as it looks so far).
Apparently this will be really fixed (hopefully really really fixed this time) on 3.0.7: https://jira.percona.com/browse/PT-167?focusedCommentId=223276&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-223276
This is still happening, at least on labsdb1009:
| 118438410 | s51772 | 10.64.37.14:50886 | enwiki_p | Execute | 264724 | User sleep | SELEC INNER JOIN redirect Ruc ON rd_from = page_id
Yes, but I have not yet done a 100% deploy (didn't have time to baby sit it) only on s8 and certain s1 hosts.
Oh, sorry, I mixed my tickets. I just was editing the related T149421. This seems to work, but I have not touched it since a long time ago.
Yeah, me neither. Since we left it running (with the new pt-kill version) I haven't seen the original issue coming back
pt-kill has been released (https://www.percona.com/doc/percona-toolkit/3.0/release_notes.html#v3-0-8-released-2018-03-13) and it has:
Buf Fixes PT-1492: pt-kill in version 3.0.7 ignores the value of the --busy-time
I am testing that pt-kill version on labsdb1009 and I just saw that:
# 2018-04-20T06:20:09 KILL 8093697 (Execute 0 sec) /* related.suggest_redlinks LIMIT:15 NM */
I cannot really believe this is not solved yet after _another_ release. I am going to monitor it and if it keeps killing stuff after 0 seconds, I will report it again.
So looks like it is not yet fixed:
# 2018-04-20T06:20:09 KILL 8093697 (Execute 0 sec) /* related.suggest_redlinks LIMIT:15 NM */
I am going back to the previous patched version and will report this.
This is still not solved. I have asked upstream to see if there is any update on when they expect to release a fix
Change 458810 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] Labs: Config template generation for pt-kill
Mentioned in SAL (#wikimedia-operations) [2018-10-01T07:54:51Z] <banyek> disabling puppet on labsdb1009, labsdb1010, labsdb1011 (T183983)
Change 458810 merged by Banyek:
[operations/puppet@production] wikireplicas: Config template generation for wmf-pt-kill
Change 463712 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] wikireplicas: quickfix for wmf-pt-kill template config
Change 463712 merged by Banyek:
[operations/puppet@production] wikireplicas: quickfix for wmf-pt-kill template config
Change 463734 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] wikireplicas: ensure wmf-pt-kill service is stopped
Change 463734 merged by Banyek:
[operations/puppet@production] wikireplicas: ensure wmf-pt-kill service is stopped
Mentioned in SAL (#wikimedia-operations) [2018-10-01T12:15:05Z] <banyek> enabling puppet on labsdb1009, labsdb1010, labsdb1011 (T183983)
Upstream updated the ticket saying: fix: 3.0.12 which was released a few days ago.
I tested it and it is indeed not fixed there (and it is not on the release notes): https://jira.percona.com/browse/PT-1492?focusedCommentId=230043&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-230043
I asked upstream and they have now changed to say it will be fixed on 3.0.13: https://jira.percona.com/browse/PT-1492?page=com.atlassian.jira.plugin.system.issuetabpanels%3Achangehistory-tabpanel
I am not sure if we wanto to close this ticket.
I mean the original problem is now solved, but it would be nice to keep track when the upsream pt-kill will be fixed
I am subscribed to the upstream bug (you can do that too) and I would close this ticket. As the upstream bug has nothing to do really with this deployment