Page MenuHomePhabricator

Infinite loop in DeferredUpdates::tryOpportunisticExecute
Closed, ResolvedPublic

Description

This is split from T252764: Automatic renaming by Fuzzybot broken to focus on the underlying issue. It is likely that our rename jobs trigger a rare code path due to them causing a lot of deferred updates.

The trace is:

  • require_once
  • RunJobs->execute
  • JobRunner->run
  • JobRunner->executeJob
  • JobRunner->doExecuteJob
  • MessageUpdateJob->run
  • MessageUpdateJob->handleRename
  • MovePage->move
  • MovePage->moveUnsafe
  • MovePage->moveToInternal
  • ManualLogEntry->publish
  • DeferredUpdates::addCallableUpdate
  • DeferredUpdates::addUpdate
  • DeferredUpdates::tryOpportunisticExecute [loop starts]
  • DeferredUpdates::enqueueUpdates
  • JobQueueGroup->push [1]
  • JobQueue->push
  • JobQueue->batchPush
  • JobQueueDB->doBatchPush
  • Wikimedia\Rdbms\DBConnRef->onTransactionPreCommitOrIdle
  • Wikimedia\Rdbms\DBConnRef->__call
  • Wikimedia\Rdbms\Database->onTransactionPreCommitOrIdle
  • Wikimedia\Rdbms\Database->endAtomic
  • Wikimedia\Rdbms\Database->commit
  • Wikimedia\Rdbms\Database->runTransactionListenerCallbacks
  • Maintenance::{closure}
  • DeferredUpdates::tryOpportunisticExecute [loop restarts]

The job that is getting pushed in [1] is RefreshSecondaryDataUpdate

The jobs in the queue are as follows:
self::$preSendUpdates:

  • MessageCacheUpdate
  • CdnCacheUpdate

self::$postSendUpdates:

  • MWCallableUpdate
  • MWCallableUpdate
  • MWCallableUpdate
  • MWCallableUpdate
  • RefreshSecondaryDataUpdate

Do note that we are hitting the self::BIG_QUEUE_SIZEcondition in rMW includes/deferred/DeferredUpdates.php:495-500

The issue is, I believe, that enqueueUpdates is not safe for recursive entrance, as it does not remove items from the queue before inserting them to the job queue (if they are queueable), which causes another opportunity to execute deferred updates.

Event Timeline

I think this is the underlying cause for T260812: Moving of translation page fails with OOM exception (database job queue backend) as well. I think WMF environment is not affected since it does not use the database backend for the job queue and hence does not trigger this loop.

Commenting out these two lines locally makes the issue go away. I'm not sure what are the drawbacks of a having a big queue size, so I'm leaving it to you to figure out what is an appropriate fix.

		if ( self::pendingUpdatesCount() >= self::BIG_QUEUE_SIZE ) {
				// If we cannot run the updates with outer transaction context, try to
				// at least enqueue all the updates that support queueing to job queue
				// self::$preSendUpdates = self::enqueueUpdates( self::$preSendUpdates );
				// self::$postSendUpdates = self::enqueueUpdates( self::$postSendUpdates );
			}

Tagging @aaron and @Krinkle who removed the recursion guard in rMWc1ebcb0ee24d: Remove $recursionGuard var from tryOpportunisticExecute(). Could you have a look?

This ticket blocks two high priority tickets for the Language team.

Change 596082 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] deferred: make DeferredUpdates::doUpdates() recursion more uniform

https://gerrit.wikimedia.org/r/596082

Rileych lowered the priority of this task from Medium to Low.EditedJan 13 2021, 1:46 AM

@Naike

This task is blocking two tasks (T260812 & T252764) for the Language Team. Can PET support us by code reviewing this patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/596082/?

We would appreciate assistance ASAP; when can we expect a timeline?

Rileych raised the priority of this task from Low to High.Jan 13 2021, 1:46 AM

Change 596082 merged by jenkins-bot:
[mediawiki/core@master] deferred: make DeferredUpdates::doUpdates() recursion more uniform

https://gerrit.wikimedia.org/r/596082

Abijeet and I have been doing some testing on this and it looks good so far. We are still planning to do more testing, after which I am planning to mark this ticket as resolved if nothing comes up. Thanks for the help everyone.

Nikerabbit claimed this task.

There has been a couple of small message key renames, but not a large one. I'm marking this resolved, but will reopen in the unlikely case that this issue wasn't fully resolved.