Page MenuHomePhabricator

SELECT /* CategoryMembershipChangeJob::run 127.0.0.1 */ GET_LOCK('CategoryMembershipUpdates:XXXX', 10) AS lockstatus
Closed, ResolvedPublicPRODUCTION ERROR

Description

fatalmonitor shows a lot of messages such as:

[10000ms] at runtime/ext_mysql: slow query: SELECT /* CategoryMembershipChangeJob::run 127.0.0.1 */ GET_LOCK('CategoryMembershipUpdates:#######', 10) AS lock status

No clue what they are though :(

Event Timeline

this is 600 of the last 1000 hhvm.log entries (~42/min)

While the bug may be valid, this will happen every time there is lag on a slave- so it is a consequence, not a cause. Without a real cause of the lag, this is just T95501 , unless the locking mediawiki model is changed. See for example, T109943

Would it make sense to not log these as slow querys?
But perhaps something else?

Our Puppet manifest for HHMV has:

slow_query_threshold => to_milliseconds('10s')

And MediaWiki CategoryMembershipChangeJob::run has a lock set to a hardcoded 10, so that triggers the SlowTimer notification.

So raising one or the other would hide the message.


@jcrespo For the list of messages, I have looked in https://logstash.wikimedia.org/ and search for "GET_LOCK('CategoryMembershipUpdates". Today I we had a few hour longs events with 500-800 such messages per minutes

@hashar yesterday we had a crashed slave failover, given the limitation of jobs of continue hitting the same server (remember we discussed this limitation on an unrelated ticket) I wouldn't be surprised with jobs having issues.

Change 310514 had a related patch set uploaded (by Aaron Schulz):
Reduce CategoryMembershipChangeJob lock timeout

https://gerrit.wikimedia.org/r/310514

Change 310514 merged by jenkins-bot:
Reduce CategoryMembershipChangeJob lock timeout

https://gerrit.wikimedia.org/r/310514

https://gerrit.wikimedia.org/r/310514 changes the GET_LOCK() from 10 to 3 seconds. No idea about the impact for the db/job, but that will surely stop the SlowTimer notification.

hashar assigned this task to aaron.

Fixed by https://gerrit.wikimedia.org/r/#/c/310514/ which reduce the lock timeout to 3 seconds. Deployed with MW-1.28-release (WMF-deploy-2016-09-20_(1.28.0-wmf.20))

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM