Investigate why query killer didn't kill 1-hour long queries
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Feb 28 2018, 1:22 PM

Description

Needs more info on the description from the parent task.

Details

	Subject	Repo	Branch	Lines +/-
	Consider as busy all queries that are not in Sleep state	operations/software	master	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Lucas_Werkmeister_WMDE	T173695 Enable constraint checks by default for users
Open	None	T103228 Improve performance of constraint check
Resolved	Lydia_Pintscher	T179839 Cache constraint check results
Resolved	Lydia_Pintscher	T179849 Cache all constraint check results per-entity
Resolved	Lucas_Werkmeister_WMDE	T181060 Cache constraint check results per-entity in ObjectCache (L) (days: 2)
Resolved	Lucas_Werkmeister_WMDE	T184812 Enable constraint result caching on Wikidata
Resolved	jcrespo	T188505 Investigate why query killer didn't kill 1-hour long queries

Event Timeline

jcrespo triaged this task as High priority.Feb 28 2018, 1:22 PM

jcrespo created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2018, 1:22 PM

jcrespo moved this task from Triage to Pending comment on the DBA board.Feb 28 2018, 1:22 PM

jcrespo mentioned this in T184812: Enable constraint result caching on Wikidata.Feb 28 2018, 2:03 PM

Lucas_Werkmeister_WMDE subscribed.Feb 28 2018, 4:04 PM

• Jonas subscribed.Feb 28 2018, 4:39 PM

Agabi10 subscribed.Feb 28 2018, 4:46 PM

jcrespo renamed this task from Investigate why query killer didn't kill 1-hour log queries to Investigate why query killer didn't kill 1-hour long queries.Mar 1 2018, 10:24 AM

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Mar 1 2018, 11:37 PM

Is there any progress so far?
Is someone actively working on this?

In T188505#4016738, @Jonas wrote:

Is there any progress so far?
Is someone actively working on this?

No, we are not yet working on this. We have lots of fires going on at the moment and this is one of them, we will try to get to it as soon as we can.

There is a strange gap on any kind of killing activity between november and march:

 171966558 | 2017-11-20 06:29:35 | wmf_slave_wikiuser_sleep      | kill 1761131953                                                  
| 171978924 | 2018-03-01 16:41:05 | wmf_slave_wikiuser_sleep      | kill 1633679517

Even if it was that, the new query killer didn't solve anyway the long running queries, those had to be killed independently.

Change 415888 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software@master] Consider as busy all queries that are not in Sleep state

https://gerrit.wikimedia.org/r/415888

gerritbot added a project: Patch-For-Review.Mar 2 2018, 4:52 PM

Running SELECT sleep(70); as wikiuser to check it is at least in some way working.

Also shown on the logs:
171978924 | 2018-03-02 17:29:03 | wmf_slave_wikiuser_slow (>60) | kill 1731917342; SELECT sleep(100)

The same query was killed:

MariaDB [wikidatawiki]> SELECT /* Wikibase\Lib\Store\Sql\WikiPageEntityMetaDataLookup::selectRevisionInformationMultiple */ rev_id, rev_content_format, rev_timestamp, page_latest, page_is_redirect, old_id, old_text, old_flags, page_title FROM `page` INNER JOIN `revision` ON ((page_latest=rev_id)) INNER JOIN `text` ON ((old_id=rev_text_id));
ERROR 2013 (HY000): Lost connection to MySQL server during query

| 171978924 | 2018-03-02 17:38:03 | wmf_slave_wikiuser_slow (>60) | kill 1732579668; SELECT  rev_id, rev_content_format, rev_tim

The thesis is that for some reason, either the query killer was disabled or crashed, or other situation, that made it not working on that specific host.

I will check the query killer is updated to the latest version and active on all production hosts, then consider this resolved.

Marostegui awarded a token.Mar 2 2018, 6:01 PM

Change 415888 merged by jenkins-bot:
[operations/software@master] Consider as busy all queries that are not in Sleep state

https://gerrit.wikimedia.org/r/415888

Mentioned in SAL (#wikimedia-operations) [2018-03-06T15:54:50Z] <jynus> deploying new query killer logic to all wikidata (s8) db replicas T188505

@Lucas_Werkmeister_WMDE I am going to resolve this ticket once it has been deployed to all of s8 (wikidata database section). I will deploy on the other sections more slowly. This is not a risk-free deploy, so please be vigilant if there is something weird happening regarding queries failing or similar issues. This will, however, unblock at least the deployments you wanted to do.

@jcrespo okay, thanks a lot for your help! So far everything seems okay to me (no crazy server load spikes that I can see, and no drop in the rate of Wikidata edits).

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

Maintenance_bot removed a project: Patch-For-Review.Apr 28 2020, 10:14 PM

Investigate why query killer didn't kill 1-hour long queriesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate why query killer didn't kill 1-hour long queries
Closed, ResolvedPublic
Actions

Related Objects
Search...