Page MenuHomePhabricator

MessageGroupStats co-operative deadlock with transactions and GET_LOCK
Closed, ResolvedPublic

Description

This is related to bug 51410, but looks like a new form of the old problem introduced by that fix.

On mediawikiwiki master queries experience lock-wait-timeout in what looks like an effective deadlock between transactions and co-operative locks.

SELECT /* MessageGroupStats::forItemInternal */ GET_LOCK('MessageGroupStats:modify:page-MediaWiki-Vagrant', 1) AS lockstatus;

UPDATE /* LinksUpdate::updateLinksTimestamp */ page SET page_links_updated = '20140829051059' WHERE page_id = '226112';

The queries are unrelated. The LinksUpdate query is perfectly ok until MessageGroupStats appears.

From the database end, it looks like MessageGroupStats get_lock() is called in a loop by a connection which can already have an open transaction with row locks on the page and translate_groupstats tables.

When the co-op lock is not acquired quickly, MessageGroupStats transactions bottleneck and queue up, collectively holding many row locks and blocking other queries like LinksUpdate *and whichever MessageGroupStats connection already holds the co-op lock*.

We should not be combining transactions and co-operative locking in this manner.


Version: master
Severity: major

Details

Reference
bz70153

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:35 AM
bzimport set Reference to bz70153.
bzimport added a subscriber: Unknown Object (MLST).

Some of this code runs in autocommit mode by runners but other times it still happens in a big transaction. I've mentioned that all of this needs to be move to the job queue.

A quick work around would be to not use GET_LOCK for web requests or to use some MW transaction hook to push this all post-COMMIT for web requests.

gerritadmin wrote:

Change 157040 had a related patch set uploaded by Aaron Schulz:
Avoid GET_LOCK in non-autocommit mode

https://gerrit.wikimedia.org/r/157040

A bit off topic, sorry...

(In reply to Aaron Schulz from comment #1)

I've mentioned that all of this needs to
be move to the job queue.

Using the job queue however can be disruptive sometimes when it gets in the way of editing, as I think bug 69669 may show (panic for some minutes while a translation-admin-related action was being completed). It would be nice to have a write-up of what should be moved to the job queue, but also of what actions should be high priority in the job queue.

(In reply to Nemo from comment #3)

A bit off topic, sorry...

(In reply to Aaron Schulz from comment #1)

I've mentioned that all of this needs to
be move to the job queue.

Using the job queue however can be disruptive sometimes when it gets in the
way of editing, as I think bug 69669 may show (panic for some minutes while
a translation-admin-related action was being completed). It would be nice to
have a write-up of what should be moved to the job queue, but also of what
actions should be high priority in the job queue.

If the problem is responsiveness, then it can always go into a small dedicated job loop in jobrunner.conf.erb.

gerritadmin wrote:

Change 157040 merged by jenkins-bot:
Avoid GET_LOCK in non-autocommit mode

https://gerrit.wikimedia.org/r/157040

All patches mentioned in this report were merged or abandoned - is there more work left to do here (if yes: please reset the bug report status to NEW or ASSIGNED), or can you close this ticket as RESOLVED FIXED?