Page MenuHomePhabricator

Could not acquire lock 'LinksUpdate:job:pageid:xxx'
Closed, ResolvedPublicPRODUCTION ERROR

Description

Seems to have started around the time 1.30.0-wmf.9 rolled out to group1

#0 /srv/mediawiki/php-1.30.0-wmf.9/includes/jobqueue/jobs/RefreshLinksJob.php(144): LinksUpdate::acquirePageLock(Wikimedia\Rdbms\DatabaseMysqli, integer, string)
#1 /srv/mediawiki/php-1.30.0-wmf.9/includes/jobqueue/jobs/RefreshLinksJob.php(118): RefreshLinksJob->runForTitle(Title)
#2 /srv/mediawiki/php-1.30.0-wmf.9/includes/jobqueue/JobRunner.php(293): RefreshLinksJob->run()
#3 /srv/mediawiki/php-1.30.0-wmf.9/includes/jobqueue/JobRunner.php(193): JobRunner->executeJob(RefreshLinksJob, Wikimedia\Rdbms\LBFactoryMulti, BufferingStatsdDataFactory, integer)
#4 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)
#5 {main}

Event Timeline

thcipriani triaged this task as Unbreak Now! priority.Jul 13 2017, 4:40 PM

UBN since I added as a train blocker.

Added as blocker as this appears to be a new log message with wmf.9.

Adding @aaron per https://www.mediawiki.org/wiki/Developers/Maintainers

@aaron: new spammy log message in production (wmf.9) as of group1 (non-wikipedias).

I don't see much noise from the logs about refreshLinks at https://logstash.wikimedia.org/goto/a029053d21a195163e68acc0a23e760e.

This has been known to happen in worse waves before. For example when Commonist keeps a gallery page of ALL files a user uploaded, so when the post 50 news ones it triggers refresh jobs 50 times for a page with 10Ks of files (failure and job recycling leading to more retries).

There's more at https://logstash.wikimedia.org/goto/7d7e551744c10bcf89e03ec78a839076 for categoryLinksUpdate, but still not that much, and mostly centered around a handful of select pages.

I suppose those can happen on pages with many revisions but few or no recent recentchange rows for some reason, causing lots of scanning.

I don't see much noise from the logs about refreshLinks at https://logstash.wikimedia.org/goto/a029053d21a195163e68acc0a23e760e.

I flagged this as a new error message since it has only since wmf.9 started logging in the exceptions channel: https://logstash.wikimedia.org/goto/b95012d87a9e8836b306c2cf7099d386

thcipriani lowered the priority of this task from Unbreak Now! to Medium.Jul 13 2017, 9:37 PM

Removing as train blocker: old log message in a new channel.

Krinkle subscribed.

Still seen in production Logstash, about 400 jobs in the past 7 days failed with a Could not acquire lock exception from LinksUpdate.

@aaron If this is normal and/or eventually consistent, could the job catch this exception in some way so as to not cause a fatal?

Imarlier subscribed.

@Krinkle too (noting here so collab doesn't get dropped)

Aside from using a narrower exception type and catching it, it's probably even easier to make acquirePageLock() return a boolean and log the error to a channel (possibly INFO level). The page_id should be extra logstash metadata, to make grouping easier. I suspect certain pages (like Commonist gallery subpages or such) are more likely to be offenders that others.

https://en.wikipedia.org/wiki/User:Sam_Sailor/CSD_log seems to be an offending page (many links, possible parallel updates).

Change 456023 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] jobqueue: Use explicit retry when refreshLinks can't get a lock

https://gerrit.wikimedia.org/r/456023

Change 456023 merged by jenkins-bot:
[mediawiki/core@master] jobqueue: Use explicit retry when refreshLinks can't get a lock

https://gerrit.wikimedia.org/r/456023

Change 456688 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.32.0-wmf.19] jobqueue: Use explicit retry when refreshLinks can't get a lock

https://gerrit.wikimedia.org/r/456688

Change 456688 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.19] jobqueue: Use explicit retry when refreshLinks can't get a lock

https://gerrit.wikimedia.org/r/456688

Logstash results for message:"Could not acquire lock" AND channel:"JobExecutor" stopped after the above patch was deployed.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM