Could not acquire lock 'LinksUpdate:job:pageid:xxx'
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

Seems to have started around the time 1.30.0-wmf.9 rolled out to group1

#0 /srv/mediawiki/php-1.30.0-wmf.9/includes/jobqueue/jobs/RefreshLinksJob.php(144): LinksUpdate::acquirePageLock(Wikimedia\Rdbms\DatabaseMysqli, integer, string)
#1 /srv/mediawiki/php-1.30.0-wmf.9/includes/jobqueue/jobs/RefreshLinksJob.php(118): RefreshLinksJob->runForTitle(Title)
#2 /srv/mediawiki/php-1.30.0-wmf.9/includes/jobqueue/JobRunner.php(293): RefreshLinksJob->run()
#3 /srv/mediawiki/php-1.30.0-wmf.9/includes/jobqueue/JobRunner.php(193): JobRunner->executeJob(RefreshLinksJob, Wikimedia\Rdbms\LBFactoryMulti, BufferingStatsdDataFactory, integer)
#4 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)
#5 {main}

Details

	Subject	Repo	Branch	Lines +/-
	jobqueue: Use explicit retry when refreshLinks can't get a lock	mediawiki/core	wmf/1.32.0-wmf.19	+23 -3
	jobqueue: Use explicit retry when refreshLinks can't get a lock	mediawiki/core	master	+23 -3

Customize query in gerrit

Related Objects

Mentioned In: T206288: Exception from LinksUpdate "Could not acquire lock for page" when a page is edited frequently

Event Timeline

thcipriani created this task.Jul 13 2017, 4:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 13 2017, 4:24 PM

thcipriani added a parent task: T167893: MW-1.30.0-wmf.9 deployment blockers.Jul 13 2017, 4:37 PM

UBN since I added as a train blocker.

Added as blocker as this appears to be a new log message with wmf.9.

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptJul 13 2017, 4:40 PM

Adding @aaron per https://www.mediawiki.org/wiki/Developers/Maintainers

@aaron: new spammy log message in production (wmf.9) as of group1 (non-wikipedias).

greg added a project: MediaWiki-Core-JobQueue.Jul 13 2017, 8:54 PM

I don't see much noise from the logs about refreshLinks at https://logstash.wikimedia.org/goto/a029053d21a195163e68acc0a23e760e.

This has been known to happen in worse waves before. For example when Commonist keeps a gallery page of ALL files a user uploaded, so when the post 50 news ones it triggers refresh jobs 50 times for a page with 10Ks of files (failure and job recycling leading to more retries).

There's more at https://logstash.wikimedia.org/goto/7d7e551744c10bcf89e03ec78a839076 for categoryLinksUpdate, but still not that much, and mostly centered around a handful of select pages.

I suppose those can happen on pages with many revisions but few or no recent recentchange rows for some reason, causing lots of scanning.

In T170596#3437768, @aaron wrote:

I don't see much noise from the logs about refreshLinks at https://logstash.wikimedia.org/goto/a029053d21a195163e68acc0a23e760e.

I flagged this as a new error message since it has only since wmf.9 started logging in the exceptions channel: https://logstash.wikimedia.org/goto/b95012d87a9e8836b306c2cf7099d386

Removing as train blocker: old log message in a new channel.

thcipriani removed a parent task: T167893: MW-1.30.0-wmf.9 deployment blockers.Jul 13 2017, 9:37 PM

Still seen in production Logstash, about 400 jobs in the past 7 days failed with a Could not acquire lock exception from LinksUpdate.

@aaron If this is normal and/or eventually consistent, could the job catch this exception in some way so as to not cause a fatal?

Krinkle moved this task from Untriaged to Dec2019/1.35.wmf.10+ on the Wikimedia-production-error board.Aug 14 2018, 2:55 AM

@Krinkle too (noting here so collab doesn't get dropped)

• Imarlier moved this task from Inbox, needs triage to To-do: Goals prioritized current Quarter on the Performance-Team board.Aug 27 2018, 8:13 PM

Aside from using a narrower exception type and catching it, it's probably even easier to make acquirePageLock() return a boolean and log the error to a channel (possibly INFO level). The page_id should be extra logstash metadata, to make grouping easier. I suspect certain pages (like Commonist gallery subpages or such) are more likely to be offenders that others.

https://en.wikipedia.org/wiki/User:Sam_Sailor/CSD_log seems to be an offending page (many links, possible parallel updates).

Change 456023 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] jobqueue: Use explicit retry when refreshLinks can't get a lock

https://gerrit.wikimedia.org/r/456023

gerritbot added a project: Patch-For-Review.Aug 28 2018, 9:17 PM

Change 456023 merged by jenkins-bot:
[mediawiki/core@master] jobqueue: Use explicit retry when refreshLinks can't get a lock

https://gerrit.wikimedia.org/r/456023

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)).Aug 29 2018, 10:00 PM

Change 456688 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.32.0-wmf.19] jobqueue: Use explicit retry when refreshLinks can't get a lock

https://gerrit.wikimedia.org/r/456688

Change 456688 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.19] jobqueue: Use explicit retry when refreshLinks can't get a lock

https://gerrit.wikimedia.org/r/456688

ReleaseTaggerBot edited projects, added MW-1.32-notes (WMF-deploy-2018-08-28 (1.32.0-wmf.19)); removed MW-1.32-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)).Aug 31 2018, 10:00 PM

Logstash results for message:"Could not acquire lock" AND channel:"JobExecutor" stopped after the above patch was deployed.

Krinkle removed a project: Patch-For-Review.Sep 1 2018, 1:23 AM

Krinkle mentioned this in T206288: Exception from LinksUpdate "Could not acquire lock for page" when a page is edited frequently.Oct 5 2018, 4:31 AM

Krinkle moved this task from Dec2019/1.35.wmf.10+ to Resolved on the Wikimedia-production-error board.May 29 2019, 4:00 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM

Could not acquire lock 'LinksUpdate:job:pageid:xxx'Closed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related Objects

Event Timeline

Could not acquire lock 'LinksUpdate:job:pageid:xxx'
Closed, ResolvedPublicPRODUCTION ERROR
Actions