GET_LOCK('CategoryMembershipUpdates:XXXX', 10) AS lockstatus
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

fatalmonitor shows a lot of messages such as:

[10000ms] at runtime/ext_mysql: slow query: SELECT /* CategoryMembershipChangeJob::run 127.0.0.1 */ GET_LOCK('CategoryMembershipUpdates:#######', 10) AS lock status

No clue what they are though :(

Details

	Subject	Repo	Branch	Lines +/-
	Reduce CategoryMembershipChangeJob lock timeout	mediawiki/core	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T147748: Large number of CategoryMembershipChangeJob::run updates are failing
Mentioned Here: T109943: Long running query from LinksUpdate::incrTableUpdate job causing general lag
T95501: Fix causes of replica lag and get it to under 5 seconds at peak

Event Timeline

hashar created this task.Apr 27 2016, 7:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 27 2016, 7:13 PM

a lot

How many, link?

this is 600 of the last 1000 hhvm.log entries (~42/min)

While the bug may be valid, this will happen every time there is lag on a slave- so it is a consequence, not a cause. Without a real cause of the lag, this is just T95501 , unless the locking mediawiki model is changed. See for example, T109943

jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.May 10 2016, 10:59 AM

Addshore added a project: CatWatch.Sep 13 2016, 4:42 PM

Restricted Application added a project: TCB-Team (now WMDE-TechWish). · View Herald TranscriptSep 13 2016, 4:42 PM

Would it make sense to not log these as slow querys?
But perhaps something else?

Our Puppet manifest for HHMV has:

slow_query_threshold => to_milliseconds('10s')

And MediaWiki CategoryMembershipChangeJob::run has a lock set to a hardcoded 10, so that triggers the SlowTimer notification.

So raising one or the other would hide the message.

@jcrespo For the list of messages, I have looked in https://logstash.wikimedia.org/ and search for "GET_LOCK('CategoryMembershipUpdates". Today I we had a few hour longs events with 500-800 such messages per minutes

@hashar yesterday we had a crashed slave failover, given the limitation of jobs of continue hitting the same server (remember we discussed this limitation on an unrelated ticket) I wouldn't be surprised with jobs having issues.

Change 310514 had a related patch set uploaded (by Aaron Schulz):
Reduce CategoryMembershipChangeJob lock timeout

https://gerrit.wikimedia.org/r/310514

gerritbot added a project: Patch-For-Review.Sep 14 2016, 10:28 AM

Change 310514 merged by jenkins-bot:
Reduce CategoryMembershipChangeJob lock timeout

https://gerrit.wikimedia.org/r/310514

https://gerrit.wikimedia.org/r/310514 changes the GET_LOCK() from 10 to 3 seconds. No idea about the impact for the db/job, but that will surely stop the SlowTimer notification.

ReleaseTaggerBot added projects: MW-1.28-release (WMF-deploy-2016-09-20_(1.28.0-wmf.20)), MW-1.28-release-notes.Sep 14 2016, 11:00 AM

Addshore moved this task from Incoming to Other on the TCB-Team (now WMDE-TechWish) board.Sep 19 2016, 8:34 AM

hashar moved this task from Untriaged to Since Dec 2018 / 1.33.wmf9 on the Wikimedia-production-error board.Sep 26 2016, 3:28 PM

hashar moved this task from Since Dec 2018 / 1.33.wmf9 to Mar 2019 / 1.33wmf20-23 on the Wikimedia-production-error board.

Fixed by https://gerrit.wikimedia.org/r/#/c/310514/ which reduce the lock timeout to 3 seconds. Deployed with MW-1.28-release (WMF-deploy-2016-09-20_(1.28.0-wmf.20))

jcrespo mentioned this in T147748: Large number of CategoryMembershipChangeJob::run updates are failing.Oct 9 2016, 10:27 AM

• MZMcBride subscribed.Oct 10 2016, 5:28 AM

Krinkle moved this task from Mar 2019 / 1.33wmf20-23 to Resolved on the Wikimedia-production-error board.Mar 19 2019, 3:33 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM

SELECT /* CategoryMembershipChangeJob::run 127.0.0.1 */ GET_LOCK('CategoryMembershipUpdates:XXXX', 10) AS lockstatusClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related Objects

Event Timeline

SELECT /* CategoryMembershipChangeJob::run 127.0.0.1 */ GET_LOCK('CategoryMembershipUpdates:XXXX', 10) AS lockstatus
Closed, ResolvedPublicPRODUCTION ERROR
Actions