Page MenuHomePhabricator

Re-institute query killer for the analytics WikiReplica
Closed, ResolvedPublic

Description

We received reports of high replag (~3-4 hours) on wikireplica-analytics (https://tools.wmflabs.org/replag/) over the holidays. I had poked @Marostegui before, and he mentioned that it was due to heavy queries on the server since the dns switchover to the new servers, and mentioned re instituting query killer to make this better. The Cloud Services team discussed it in our weekly meeting today, and we are for it, let us know what we can do to help make this happen, thank you!

Event Timeline

madhuvishy created this task.
Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.

Hi,

Yeah, we have been seeing long delays on the analytics replicate for sometime already.
Some days before Christmas Jaime and myself were discussing whether we should enable the query killer again to avoid those super long connections and prevent delays. I believe we were almost sure about it, we needed to talk to you guys.
There are still some questions that need to be answered

  1. How long should the query killer allow connections to run for on the analytics replicas? 6 hours?
  2. If we set up a query killer on the analytics replica, we must set it too on the web replica. That one needs to be a lot more aggressive. What would be a good threshold for that one? 5 minutes?

Thoughts?

The thing is, there was already a query killer during Christmas, analytics being for over 28800 seconds and web over 3600, but they did not kill many queries (~10 each in 2 weeks).

The thing is, there was already a query killer during Christmas, analytics being for over 28800 seconds and web over 3600, but they did not kill many queries (~10 each in 2 weeks).

I recall killing some long running queries on labsdb1010 during Christmas, I cannot recall if they have been running for longer than 28800 seconds or not though, so we might need to check if they are indeed working as expected.

@madhuvishy we need input from you guys about what a reasonable query killer threshold would be for both web and analytics host.

We have been discussing also another track to explore, which is setting up active-active on the labs hosts, meaning that the host that is currently on standby could also be active and serve analytics.
Even if we go for that, we still need the query killer with some more aggressive settings given that 8 hours doesn't really kill many, so maybe it needs to be decreased a bit.

What were the old query killer thresholds? We could start with those and add 50% to see how it goes.

What were the old query killer thresholds

It depended on the load, at times it was as low as 1 hour- when on high load-, at other times it was 4 hours. The whole point of separating analytics vs. web is so that people that needs up to date results use the web endpoint, and people that need long-running queries use the analytics one (which could have lag).

yep understood, the way I'm interpreting is that there is no objective right answer here. At first blush I would be inclined to take queues from that. Web query killer at 1h and analytics at 4h and we see how it goes.

edit: 1h even seems high for the web but I guess we'll find out if it's actually a problem?

Web query killer at 1h and analytics at 4h and we see how it goes.

I've implemented that, unpuppetized waiting for your feedback. When happy, I will puppetize and set a couple of hiera keys that you could use to modify the values as you wish.

@jcrespo Sounds great! Let's puppetize and tweak later if needed. Thank you :)

@jcrespo Sounds great! Let's puppetize and tweak later if needed. Thank you :)

+1

I have not seen any more delays on the wiki replicas since this was set up, so those thresholds are looking pretty good!

I've seen at times 30 seconds, but I guess that is ok?

I would consider that normal and within margins, yeah

We are seeing that some queries on Execute state are not being killed.
Playing around with the current way of running pt-kill:

pt-kill --print --kill --victims all --interval 10 --busy-time 14400 --match-command Query --match-user '^[spu][0-9]' F=/dev/null -S /run/mysqld/mysqld.sock

We realised that using either --match-command 'Query|Execute' or even --match-command Execute wasn't working because all queries in Execute state, even new queries, were being killed.
It looked like --busy-time was being ignored, and it looks it is, indeed a bug: https://jira.percona.com/browse/PT-167

The bug report states that it has been fixed on 3.0.5, but 3.0.6 still doesn't work.
I have applied what the user suggested as a patch and looks like it works fine - I have left it running only on labsdb1009 as testing to make sure nothing else is being killed - so we'll see how it works in the next few days.
For the record this is what is running:

This is the original run:

pt-kill --print --kill --victims all --interval 10 --busy-time 14400 --match-command Query --match-user '^[spu][0-9]' F=/dev/null -S /run/mysqld/mysqld.sock

And this is what is running now on labsdb1009:

/home/marostegui/pt-kill-patched  --print --kill --victims all --interval 10 --busy-time 14400 --match-command 'Query|Execute' --match-user '^[spu][0-9]' F=/dev/null -S /run/mysqld/mysqld.sock

And this is the patch:

root@labsdb1009:~# diff -u /usr/bin/pt-kill /home/marostegui/pt-kill-patched
--- /usr/bin/pt-kill	2017-01-02 13:51:07.000000000 +0000
+++ /home/marostegui/pt-kill-patched	2018-02-19 16:34:58.412981273 +0000
@@ -3512,7 +3512,7 @@
          next QUERY;
       }

-      if ( $find_spec{busy_time} && ($query->{Command} || '') eq 'Query' ) {
+      if ( $find_spec{busy_time} && ($query->{Command} || '') ne 'Sleep' ) {
          next QUERY unless defined($query->{Time});
          if ( $query->{Time} < $find_spec{busy_time} ) {
             PTDEBUG && _d("Query isn't running long enough");

Reporting status:

So far only queries on Query state and running longer than 14400 seconds have been killed over night (two queries) so, so far so good.

labsdb1009 has killed a couple of legit queries, so I have applied the same patch to labsdb1010 and labsdb1011 and leave it running to get a higher sample of queries and see if it is indeed working fine (as it looks so far).

The bug report states that it has been fixed on 3.0.5, but 3.0.6 still doesn't work.

Apparently this will be really fixed (hopefully really really fixed this time) on 3.0.7: https://jira.percona.com/browse/PT-167?focusedCommentId=223276&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-223276

This is still happening, at least on labsdb1009:

| 118438410 | s51772          | 10.64.37.14:50886 | enwiki_p           | Execute |  264724 | User sleep                       | SELEC
INNER JOIN redirect Ruc ON rd_from = page_id

apparently not because bug, query killer was disabled.

So this has been working fine for the last few weeks then, right?

Yes, but I have not yet done a 100% deploy (didn't have time to baby sit it) only on s8 and certain s1 hosts.

Yes, but I have not yet done a 100% deploy (didn't have time to baby sit it) only on s8 and certain s1 hosts.

I'm confused, this ticket is only for labs hosts, isn't it?

Oh, sorry, I mixed my tickets. I just was editing the related T149421. This seems to work, but I have not touched it since a long time ago.

Oh, sorry, I mixed my tickets. I just was editing the related T149421. This seems to work, but I have not touched it since a long time ago.

Yeah, me neither. Since we left it running (with the new pt-kill version) I haven't seen the original issue coming back

pt-kill has been released (https://www.percona.com/doc/percona-toolkit/3.0/release_notes.html#v3-0-8-released-2018-03-13) and it has:

Buf Fixes


PT-1492: pt-kill in version 3.0.7 ignores the value of the --busy-time

I am testing that pt-kill version on labsdb1009 and I just saw that:

# 2018-04-20T06:20:09 KILL 8093697 (Execute 0 sec) /* related.suggest_redlinks LIMIT:15 NM */

I cannot really believe this is not solved yet after _another_ release. I am going to monitor it and if it keeps killing stuff after 0 seconds, I will report it again.

So looks like it is not yet fixed:

# 2018-04-20T06:20:09 KILL 8093697 (Execute 0 sec) /* related.suggest_redlinks LIMIT:15 NM */

I am going back to the previous patched version and will report this.

This is still not solved. I have asked upstream to see if there is any update on when they expect to release a fix

Change 458810 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] Labs: Config template generation for pt-kill

https://gerrit.wikimedia.org/r/458810

Mentioned in SAL (#wikimedia-operations) [2018-10-01T07:54:51Z] <banyek> disabling puppet on labsdb1009, labsdb1010, labsdb1011 (T183983)

Change 458810 merged by Banyek:
[operations/puppet@production] wikireplicas: Config template generation for wmf-pt-kill

https://gerrit.wikimedia.org/r/458810

Change 463712 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] wikireplicas: quickfix for wmf-pt-kill template config

https://gerrit.wikimedia.org/r/463712

Change 463712 merged by Banyek:
[operations/puppet@production] wikireplicas: quickfix for wmf-pt-kill template config

https://gerrit.wikimedia.org/r/463712

Change 463734 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] wikireplicas: ensure wmf-pt-kill service is stopped

https://gerrit.wikimedia.org/r/463734

Change 463734 merged by Banyek:
[operations/puppet@production] wikireplicas: ensure wmf-pt-kill service is stopped

https://gerrit.wikimedia.org/r/463734

Mentioned in SAL (#wikimedia-operations) [2018-10-01T12:15:05Z] <banyek> enabling puppet on labsdb1009, labsdb1010, labsdb1011 (T183983)

Upstream updated the ticket saying: fix: 3.0.12 which was released a few days ago.
I tested it and it is indeed not fixed there (and it is not on the release notes): https://jira.percona.com/browse/PT-1492?focusedCommentId=230043&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-230043
I asked upstream and they have now changed to say it will be fixed on 3.0.13: https://jira.percona.com/browse/PT-1492?page=com.atlassian.jira.plugin.system.issuetabpanels%3Achangehistory-tabpanel

I am not sure if we wanto to close this ticket.
I mean the original problem is now solved, but it would be nice to keep track when the upsream pt-kill will be fixed

I am not sure if we wanto to close this ticket.
I mean the original problem is now solved, but it would be nice to keep track when the upsream pt-kill will be fixed

I am subscribed to the upstream bug (you can do that too) and I would close this ticket. As the upstream bug has nothing to do really with this deployment