Needs more info on the description from the parent task.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Consider as busy all queries that are not in Sleep state | operations/software | master | +2 -2 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Lucas_Werkmeister_WMDE | T173695 Enable constraint checks by default for users | |||
Open | None | T103228 Improve performance of constraint check | |||
Resolved | Lydia_Pintscher | T179839 Cache constraint check results | |||
Resolved | Lydia_Pintscher | T179849 Cache all constraint check results per-entity | |||
Resolved | Lucas_Werkmeister_WMDE | T181060 Cache constraint check results per-entity in ObjectCache (L) (days: 2) | |||
Resolved | Lucas_Werkmeister_WMDE | T184812 Enable constraint result caching on Wikidata | |||
Resolved | jcrespo | T188505 Investigate why query killer didn't kill 1-hour long queries |
Event Timeline
No, we are not yet working on this. We have lots of fires going on at the moment and this is one of them, we will try to get to it as soon as we can.
There is a strange gap on any kind of killing activity between november and march:
171966558 | 2017-11-20 06:29:35 | wmf_slave_wikiuser_sleep | kill 1761131953 | 171978924 | 2018-03-01 16:41:05 | wmf_slave_wikiuser_sleep | kill 1633679517
Even if it was that, the new query killer didn't solve anyway the long running queries, those had to be killed independently.
Change 415888 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software@master] Consider as busy all queries that are not in Sleep state
Running SELECT sleep(70); as wikiuser to check it is at least in some way working.
Also shown on the logs:
171978924 | 2018-03-02 17:29:03 | wmf_slave_wikiuser_slow (>60) | kill 1731917342; SELECT sleep(100)
The same query was killed:
MariaDB [wikidatawiki]> SELECT /* Wikibase\Lib\Store\Sql\WikiPageEntityMetaDataLookup::selectRevisionInformationMultiple */ rev_id, rev_content_format, rev_timestamp, page_latest, page_is_redirect, old_id, old_text, old_flags, page_title FROM `page` INNER JOIN `revision` ON ((page_latest=rev_id)) INNER JOIN `text` ON ((old_id=rev_text_id)); ERROR 2013 (HY000): Lost connection to MySQL server during query | 171978924 | 2018-03-02 17:38:03 | wmf_slave_wikiuser_slow (>60) | kill 1732579668; SELECT rev_id, rev_content_format, rev_tim
The thesis is that for some reason, either the query killer was disabled or crashed, or other situation, that made it not working on that specific host.
I will check the query killer is updated to the latest version and active on all production hosts, then consider this resolved.
Change 415888 merged by jenkins-bot:
[operations/software@master] Consider as busy all queries that are not in Sleep state
Mentioned in SAL (#wikimedia-operations) [2018-03-06T15:54:50Z] <jynus> deploying new query killer logic to all wikidata (s8) db replicas T188505
@Lucas_Werkmeister_WMDE I am going to resolve this ticket once it has been deployed to all of s8 (wikidata database section). I will deploy on the other sections more slowly. This is not a risk-free deploy, so please be vigilant if there is something weird happening regarding queries failing or similar issues. This will, however, unblock at least the deployments you wanted to do.
@jcrespo okay, thanks a lot for your help! So far everything seems okay to me (no crazy server load spikes that I can see, and no drop in the rate of Wikidata edits).