Page MenuHomePhabricator

DBTransactionError "transaction round 'LinksUpdate::doUpdate' still running" (via RunSingleJob.php)
Closed, DeclinedPublicPRODUCTION ERROR

Description

Error

Request URL: jobrunner.discovery.wmnet /rpc/RunSingleJob.php
Request ID: XTiRIwpAMEAAAJn4AA8AAAAD

message
[{exception_id}] {exception_url}   Wikimedia\Rdbms\DBTransactionError from line 271 of /srv/mediawiki/php-1.34.0-wmf.15/includes/libs/rdbms/lbfactory/LBFactory.php: MediaWiki::restInPeace: transaction round 'LinksUpdate::doUpdate' still running
trace
#0 /srv/mediawiki/php-1.34.0-wmf.15/includes/MediaWiki.php(922): Wikimedia\Rdbms\LBFactory->commitMasterChanges(string)
#1 /srv/mediawiki/rpc/RunSingleJob.php(93): MediaWiki->restInPeace()
#2 {main}

Impact

TBD

Notes

There's about 20 of these for 1.34.0-wmf.15 in group0 or group1 today. I'm assuming they're not blockers for train. If they are, please tell.

Event Timeline

Krinkle added a project: Performance-Team.
Krinkle subscribed.

It loos like these have only happened since .14, and usually happen after a previous update has been aborted by an exception. It then continues with the next update (which is good), but leaves the transaction in a dirty state (not good).

Not sure if this is due to a change in LinksUpdate or in the higher-level DeferredUpdates code. I'll take a look at recent changes to DeferredUpdates in Aaron's absence.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:06 PM
Krinkle added a project: WMF-JobQueue.

Oops, didn't realise this was recently added intentionally. I removed it because it looks like a core issue rather than an EventBus/Kafka/ChangeProp issue. But no problem either way.

mobrovac subscribed.

Oops, didn't realise this was recently added intentionally. I removed it because it looks like a core issue rather than an EventBus/Kafka/ChangeProp issue. But no problem either way.

I added it as it manifests itself there, but I agree that the problem lies in mw-core rather than the job queue.

~35 of these since 1.35.0-wmf.8 hit group 1 @ 17:09 UTC.

Krinkle renamed this task from restInPeace: transaction round 'LinksUpdate::doUpdate' still running to DBTransactionError "transaction round 'LinksUpdate::doUpdate' still running (via RunSingleJob.php).Mar 6 2020, 11:09 PM
Krinkle renamed this task from DBTransactionError "transaction round 'LinksUpdate::doUpdate' still running (via RunSingleJob.php) to DBTransactionError "transaction round 'LinksUpdate::doUpdate' still running" (via RunSingleJob.php).
Krinkle moved this task from Untriaged to Core on the WMF-JobQueue board.
Krinkle edited projects, added MediaWiki-Core-JobQueue; removed MediaWiki-General.
Krinkle added a subscriber: aaron.
Aklapper added a subscriber: AMooney.

@AMooney: Assuming that "Set projects" was accidentally used instead of "Add projects", hence restoring some previous project tags.

RunSingleJob.php is part of the operation code using code from EventBus extension (JobExecutor class), I am not sure if this is a problem of mediawiki/core

@Umherirrender The entry point wraps the core JobRunner logic. It's not clear at this point which code is responsible for leaving the transaction open.

The only place where 'LinksUpdate::doUpdate' is in use is:

		// Run post-commit hook handlers without DBO_TRX
		DeferredUpdates::addUpdate( new AutoCommitUpdate(
			$this->getDB(),
			__METHOD__,
			function () {
				$this->getHookRunner()->onLinksUpdateComplete( $this, $this->ticket );
			}
		) );

So this looks like a hook handler is open its own transaction.
There is no way to run that for each hook and include the hook name after __METHOD__, just logging could be done here, but HookContainer is hot code without logging.

https://codesearch.wmcloud.org/deployed/?q=LinksUpdateComplete&i=nope&files=&repos=

There are some hook handlers, some are adding defered updates as well, that cannot make problems (from my understanding).

  • CirrusSearch - Added Jobs to queue
  • EventBus - added DeferredUpdates
  • GeoData - Hooks::doLinksUpdate -> There are LBFactory::commitAndWaitForReplication calls after inserts/deletions with the provided ticket
  • GlobalUsage - Calls GlobalUsage::insertLinks/::deleteLinksFromPage -> There are LBFactory::commitAndWaitForReplication calls with the provided ticket
  • GlobalUserPage - Calls CacheInvalidator::invalidate -> added jobs to queue
  • PageAssessments - Calls PageAssessmentsDAO::doUpdates -> There are LBFactory::commitAndWaitForReplication calls with the provided ticket
  • PageTriage - added DeferredUpdates
  • UploadWizard - Calls UploadWizardCampaign::invalidateCache -> Calls ::touchCheckKey on MainWANObjectCache
  • Wikibase - Calls UsageUpdater::replaceUsagesForPage -> Complex to follow all code path, seen startAtomic/endAtomic/rollback

It seems there is no way to found the extension for this to make problem. I have no further idea how to look deeper in this production error. Maybe not understanding how transactions are still open in LBFactory.

I think the issue might be different from that.

  • If a hook handler tried to start a new transaction, it would fail right there and report that error from the line of code that started the second (bad) one.
  • If a hook handler started but not ended an atomic section, it would fail at the end of the outer "AutoCommitUpdate" for the fname not matching, and thus be reportd from there.
  • If a hook handler tried to close a transaction without opening one, it would fail right there for the fname not matching.

What we are seeing in this report is from MediaWiki::restInPeace, after the AutoCommitUpdate and hooks have finished running. The report says that this transaction remained open, which means it was not somehow not closed by AutoCommitUpdate, or somehow re-opened again between then and the end of the process.

From the last 90 days on the mediawiki-errors dashboard, I find no matches for message:"transaction round" that relate "LinksUpdate".

I did find two relating to ActivityUpdateJob, for which I've filed a new task.