Page MenuHomePhabricator

Investigate why query killer didn't kill 1-hour long queries
Closed, ResolvedPublic

Description

Needs more info on the description from the parent task.

Details

Related Gerrit Patches:

Event Timeline

jcrespo triaged this task as High priority.Feb 28 2018, 1:22 PM
jcrespo created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2018, 1:22 PM
jcrespo moved this task from Triage to Next on the DBA board.Feb 28 2018, 1:22 PM
jcrespo renamed this task from Investigate why query killer didn't kill 1-hour log queries to Investigate why query killer didn't kill 1-hour long queries.Mar 1 2018, 10:24 AM

Is there any progress so far?
Is someone actively working on this?

Is there any progress so far?
Is someone actively working on this?

No, we are not yet working on this. We have lots of fires going on at the moment and this is one of them, we will try to get to it as soon as we can.

There is a strange gap on any kind of killing activity between november and march:

 171966558 | 2017-11-20 06:29:35 | wmf_slave_wikiuser_sleep      | kill 1761131953                                                  
| 171978924 | 2018-03-01 16:41:05 | wmf_slave_wikiuser_sleep      | kill 1633679517

Even if it was that, the new query killer didn't solve anyway the long running queries, those had to be killed independently.

Change 415888 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software@master] Consider as busy all queries that are not in Sleep state

https://gerrit.wikimedia.org/r/415888

jcrespo claimed this task.EditedMar 2 2018, 5:27 PM
jcrespo moved this task from Next to In progress on the DBA board.

Running SELECT sleep(70); as wikiuser to check it is at least in some way working.

Also shown on the logs:
171978924 | 2018-03-02 17:29:03 | wmf_slave_wikiuser_slow (>60) | kill 1731917342; SELECT sleep(100)

The same query was killed:

MariaDB [wikidatawiki]> SELECT /* Wikibase\Lib\Store\Sql\WikiPageEntityMetaDataLookup::selectRevisionInformationMultiple */ rev_id, rev_content_format, rev_timestamp, page_latest, page_is_redirect, old_id, old_text, old_flags, page_title FROM `page` INNER JOIN `revision` ON ((page_latest=rev_id)) INNER JOIN `text` ON ((old_id=rev_text_id));
ERROR 2013 (HY000): Lost connection to MySQL server during query

| 171978924 | 2018-03-02 17:38:03 | wmf_slave_wikiuser_slow (>60) | kill 1732579668; SELECT  rev_id, rev_content_format, rev_tim

The thesis is that for some reason, either the query killer was disabled or crashed, or other situation, that made it not working on that specific host.

I will check the query killer is updated to the latest version and active on all production hosts, then consider this resolved.

Change 415888 merged by jenkins-bot:
[operations/software@master] Consider as busy all queries that are not in Sleep state

https://gerrit.wikimedia.org/r/415888

Mentioned in SAL (#wikimedia-operations) [2018-03-06T15:54:50Z] <jynus> deploying new query killer logic to all wikidata (s8) db replicas T188505

jcrespo closed this task as Resolved.Mar 6 2018, 4:34 PM

@Lucas_Werkmeister_WMDE I am going to resolve this ticket once it has been deployed to all of s8 (wikidata database section). I will deploy on the other sections more slowly. This is not a risk-free deploy, so please be vigilant if there is something weird happening regarding queries failing or similar issues. This will, however, unblock at least the deployments you wanted to do.

@jcrespo okay, thanks a lot for your help! So far everything seems okay to me (no crazy server load spikes that I can see, and no drop in the rate of Wikidata edits).