Page MenuHomePhabricator

Failed to acquire page lock in LinksUpdate
Closed, ResolvedPublic

Description

After switching refreshLinks to the kafka queue, there's been one instance of an exception that was not observed before (for all ot of them see https://logstash.wikimedia.org/goto/bb6bf23b741ea699c0d874950005f799)

Exception executing job: refreshLinks Template:API causeAction=edit-page causeAgent=Tgr (WMF) pages={"266691":[0,"Wikibase/API/ja"]} requestId=Wo8QIQpAMD0AAGlEvgUAAABA rootJobSignature=e7ab18aa1dac3512ab187653f487ac00050727c2 rootJobTimestamp=20180222184658 triggeredRecursive=1 : RuntimeException: Could not acquire lock 'LinksUpdate:job:pageid:266691'.

with the following stack trace:

#0 /srv/mediawiki/php-1.31.0-wmf.22/includes/jobqueue/jobs/RefreshLinksJob.php(148): LinksUpdate::acquirePageLock(Wikimedia\Rdbms\DatabaseMysqli, integer, string)
#1 /srv/mediawiki/php-1.31.0-wmf.22/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title)
#2 /srv/mediawiki/php-1.31.0-wmf.22/extensions/EventBus/includes/JobExecutor.php(51): RefreshLinksJob->run()
#3 /srv/mediawiki/rpc/RunSingleJob.php(79): JobExecutor->execute(array)
#4 {main}

Simultaneously with the spike in errors like this, there's been a huge spike in refreshLinks job processing latency - mean (and p99) went up to 2 minutes. Also immediately after these, there's been a bunch of Retry Count Exceeded logs from change-prop.

The rate of refreshLinks jobs on test wikis is very low, so it's not clear whether this is some one-off issue due to some other outage or a symptom of a large problem, so this needs to be investigated before proceeding with rolling out the refresh links job to other wikis.

Event Timeline

Pchelolo created this task.

This happened again and here's the chronology of events according to logs and kafka message for a single page:

  1. 2018-03-01T15:56:05 Event fired to refreshLinks for a page https://www.wikidata.org/wiki/User:Jura1/test4
  2. 2018-03-01T15:56:17 Request for JobExecutor fails with 504 Gateway Timeout, a retry event is fired.
  3. 2018-03-01T15:57:32 Log message appear that Could not acquire lock 'LinksUpdate:job:pageid:21858662' and the following one:
SlowTimer [15000ms] at runtime/ext_mysql: slow query: SELECT /* LinksUpdate::acquirePageLock  */ GET_LOCK('LinksUpdate:job:pageid:21858662', 15) AS lockstatus
  1. 2018-03-01T15:57:32 Another retry event is fired and executed after a more significant delay.

No more log messages appear and there's not Retry count exceeded message in the CP logs.

So it seems like the original request to execute the job failed with 504, then the retry event is issued, the retry delay for the first retry is 1 minute, that one fails with acquiring lock error after 15 seconds of trying (the timestamps match up perfectly) then another retry is issued and for that one the delay is even higher, so the last retry actually succeeds.

Given that for the Job Queue it's not very important to process events as quickly as possible, I propose to try significantly increasing the retry delays (they grow exponentially, but there's only 2 of them, so now it's 1 minute and 6 minutes) - I propose to change the formula and have 5 minutes and 15 minutes.

Tracked another instance and it's a bit different:

  1. 2018-03-01T15:23:37 - original event fired
  2. 2018-03-01T15:23:38 - SlowTimer [13893ms] at runtime/ext_mysql: slow query: SELECT /* LinksUpdate::acquirePageLock */
  3. 2018-03-01T15:23:46 - SlowTimer [15000ms] at runtime/ext_mysql: slow query: SELECT /* LinksUpdate::acquirePageLock */
  4. 2018-03-01T15:23:46 - /rpc/RunSingleJob.php RuntimeException from line 205 of /srv/mediawiki/php-1.31.0-wmf.22/includes/deferred/LinksUpdate.php: Could not acquire lock 'LinksUpdate:job:pageid:8163150'.
  5. 2018-03-01T15:23:52
    • retry event fired with error_status 500
    • Could not acquire lock 'LinksUpdate:job:pageid:8163150'.
    • SlowTimer [15000ms] at runtime/ext_mysql: slow query: SELECT /* LinksUpdate::acquirePageLock */

Several SlowTimer logs were fired because of the other jobs trying to update the same page.

After exploring other cases the reason seems clear: the refreshLinks job releases the page lock only after waiting for replica lag which creates lock contention if many jobs try to update the same page quickly of the replication lag is high at the moment. The old queue was waiting for the replication lag after every single job execution, while the new queue does not. This means that the proposed solution of transferring the replication lag waiting code will probably help by decreasing the job execution rate in case the replication lag is high

Since the deployment of the fix, this has not happened again even though the rate of refreshLinks jobs is fairly high right now. I'll move it to done but won't close just yet in order give us more time to verify the issue is gone.

After merging the patch to wait for replicas to catch up in the last 24 hours we've seen only 9 occurrences of this issue and since the retry of a failed job succeeded I think we can state that it's good enough (TM) and resolve the ticket.