Page MenuHomePhabricator

TranslatablePageMoveJob commit while in atomic sections
Closed, ResolvedPublicPRODUCTION ERROR

Description

Spotted in production:

[{exception_id}] {exception_url} Wikimedia\Rdbms\DBUnexpectedError from line 2939 of /srv/mediawiki/php-1.30.0-wmf.18/includes/libs/rdbms/database/Database.php: DatabaseMessageIndex::lock: Got COMMIT while atomic sections MovePage::move are still open.

Urls are all like: /rpc/RunJobs.php?wiki=mediawikiwiki&type=TranslatablePageMoveJob&maxtime=60&maxmem=300M

Stack trace:

#0 /srv/mediawiki/php-1.30.0-wmf.18/extensions/Translate/utils/MessageIndex.php(428): Wikimedia\Rdbms\Database->commit(string, string)
#1 /srv/mediawiki/php-1.30.0-wmf.18/extensions/Translate/utils/MessageIndex.php(150): DatabaseMessageIndex->lock()
#2 /srv/mediawiki/php-1.30.0-wmf.18/extensions/Translate/utils/MessageIndexRebuildJob.php(36): MessageIndex->rebuild()
#3 /srv/mediawiki/php-1.30.0-wmf.18/extensions/Translate/utils/MessageHandle.php(181): MessageIndexRebuildJob->run()
#4 /srv/mediawiki/php-1.30.0-wmf.18/extensions/Translate/tag/PageTranslationHooks.php(944): MessageHandle->isValid()
#5 /srv/mediawiki/php-1.30.0-wmf.18/includes/Hooks.php(186): PageTranslationHooks::onMoveTranslationUnits(Title, Title, User, integer, integer, string, Revision)
#6 /srv/mediawiki/php-1.30.0-wmf.18/includes/MovePage.php(419): Hooks::run(string, array)
#7 [internal function]: Closure$MovePage::move(Wikimedia\Rdbms\DatabaseMysqli, string)
#8 /srv/mediawiki/php-1.30.0-wmf.18/includes/libs/rdbms/database/Database.php(2861): call_user_func_array(Closure$MovePage::move;2398, array)
#9 /srv/mediawiki/php-1.30.0-wmf.18/includes/deferred/AtomicSectionUpdate.php(35): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure$MovePage::move;2398)
#10 /srv/mediawiki/php-1.30.0-wmf.18/includes/deferred/DeferredUpdates.php(257): AtomicSectionUpdate->doUpdate()
#11 /srv/mediawiki/php-1.30.0-wmf.18/includes/deferred/DeferredUpdates.php(210): DeferredUpdates::runUpdate(AtomicSectionUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#12 /srv/mediawiki/php-1.30.0-wmf.18/includes/deferred/DeferredUpdates.php(131): DeferredUpdates::execute(array, string, integer)
#13 /srv/mediawiki/php-1.30.0-wmf.18/includes/jobqueue/JobRunner.php(301): DeferredUpdates::doUpdates()
#14 /srv/mediawiki/php-1.30.0-wmf.18/includes/jobqueue/JobRunner.php(193): JobRunner->executeJob(TranslatablePageMoveJob, Wikimedia\Rdbms\LBFactoryMulti, BufferingStatsdDataFactory, integer)
#15 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)
#16 {main}

Related Objects

StatusSubtypeAssignedTask
ResolvedRelease demon
ResolvedPRODUCTION ERRORNikerabbit

Event Timeline

This is not a new problem, is it, just spotted now or more strict logging? It's been awhile since this code in Translate was last changed. I could really use some advice how to fix this properly, as I have little time to research this on my own.

I just spotted it, didn't realize it's been a thing before. I'm not the right person. @aaron probably is.

Probably the onMoveTranslationUnits handler should be a closure sent to DeferredUpdates. Anything that needs to COMMIT/BEGIN within it's scope needs full transaction control, which is not usually guaranteed when some hook triggers.

Looks like this is still the case. Last spotted in production on March 29.

type=mediawiki channel=exception level=ERROR

/rpc/RunJobs.php?wiki=metawiki&type=TranslatablePageMoveJob&maxtime=60&maxmem=300M

DatabaseMessageIndex::lock: Got COMMIT while atomic sections MovePage::move are still open.

#0 /srv/mediawiki/php-1.31.0-wmf.26/extensions/Translate/utils/MessageIndex.php(443): Wikimedia\Rdbms\Database->commit(string, string)
#1 /srv/mediawiki/php-1.31.0-wmf.26/extensions/Translate/utils/MessageIndex.php(161): DatabaseMessageIndex->lock()
#2 /srv/mediawiki/php-1.31.0-wmf.26/extensions/Translate/utils/MessageIndexRebuildJob.php(36): MessageIndex->rebuild()
#3 /srv/mediawiki/php-1.31.0-wmf.26/extensions/Translate/utils/MessageHandle.php(181): MessageIndexRebuildJob->run()
#4 /srv/mediawiki/php-1.31.0-wmf.26/extensions/Translate/tag/PageTranslationHooks.php(1060): MessageHandle->isValid()
#5 /srv/mediawiki/php-1.31.0-wmf.26/includes/Hooks.php(177): PageTranslationHooks::onMoveTranslationUnits(Title, Title, User, integer, integer, string, Revision)
#6 /srv/mediawiki/php-1.31.0-wmf.26/includes/Hooks.php(205): Hooks::callHook(string, array, array, NULL)
#7 /srv/mediawiki/php-1.31.0-wmf.26/includes/MovePage.php(419): Hooks::run(string, array)
#8 [internal function]: Closure$MovePage::move(Wikimedia\Rdbms\DatabaseMysqli, string)
#9 /srv/mediawiki/php-1.31.0-wmf.26/includes/libs/rdbms/database/Database.php(3167): call_user_func_array(Closure$MovePage::move;2878, array)
#10 /srv/mediawiki/php-1.31.0-wmf.26/includes/deferred/AtomicSectionUpdate.php(35): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure$MovePage::move;2878)
#11 /srv/mediawiki/php-1.31.0-wmf.26/includes/deferred/DeferredUpdates.php(259): AtomicSectionUpdate->doUpdate()
#12 /srv/mediawiki/php-1.31.0-wmf.26/includes/deferred/DeferredUpdates.php(210): DeferredUpdates::runUpdate(AtomicSectionUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#13 /srv/mediawiki/php-1.31.0-wmf.26/includes/deferred/DeferredUpdates.php(131): DeferredUpdates::execute(array, string, integer)
#14 /srv/mediawiki/php-1.31.0-wmf.26/includes/jobqueue/JobRunner.php(300): DeferredUpdates::doUpdates()
#15 /srv/mediawiki/php-1.31.0-wmf.26/includes/jobqueue/JobRunner.php(193): JobRunner->executeJob(TranslatablePageMoveJob, Wikimedia\Rdbms\LBFactoryMulti, BufferingStatsdDataFactory, integer)
#16 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)
#17 {main}

It looks like it really just wants to flush its select snapshot, per source code.

@aaron Is there a better way for them to do that, in a way that will work gracefully regardless of how the code is triggered? (deferred, job, nested job).

The message index code could do for a large amount of rework. In the meantime, I can't tell why the MessageIndexRebuildJob::newJob() instance must run immediately in isValid()...it's not like the method recheck's what it did before after the rebuild. If nothing else depends on it being immediate, then it should use a DeferredUpdate. If it has to be immediate...then CONN_TRX_AUTO can be considered (as long as it doesn't deadlock by having to transactions updating the same rows).

175                 // Do another check that the group actually exists
176                 $group = $this->getGroup();
177                 if ( !$group ) {
178                         $warning = "MessageIndex is out of date – refers to unknown group {$groups[0]}. ";
179                         $warning .= 'Doing a rebuild.';
180                         wfWarn( $warning );
181                         MessageIndexRebuildJob::newJob()->run();
182 
183                         return false;
184                 }

I can't tell why the MessageIndexRebuildJob::newJob() instance must run immediately in isValid()

This doesn't have to be immediate, it can also return false and just insert the job into the queue. However, this code path should not be triggered in normal usage. I believe WMF's requirement to delay MessageIndexRebuildJob in the general cause makes this code path more common and probable breaks the expectations of TranslateMovePage – I would look into that direction for fixing this instance of the issue.

More generally, hitting this code path from out sources indicates there is either a delay executing the rebuild job or bug that does not trigger rebuild at all. Both have user visible impact. I was under the assumption that these jobs are processed with high priority. There are quite a few bugs reported about CentralNotice that are pending investigation. In fact this could be a cause for the slowdowns reported. Is there profiling information available for how long the rebuild job takes and what it consists of?

The message index code could do for a large amount of rework.

I am sure a lot of code could. I'm open to prioritizing actionable suggestions.

For the move page issue... TranslatablePageMoveJob sets PageTranslationHooks::$jobQueueRunning = true; which should prevent PageTranslationHooks::onMoveTranslationUnits from doing anything. I am having hard time to understand the stacktrace as there is no single line from TranslatablePageMoveJob there, but I'll try. DeferredUpdates::doUpdates(); is called after $job->run() has finished successfully. This means all the moves have been done and logged. I assume that MovePage has inserted some deferred updates that include running the hook which PageTranslationHooks::onMoveTranslationUnits listens to, but PageTranslationHooks::$jobQueueRunning = false; by then and the message index is out of date because of the moves.

We could insert forced immediate MessageIndex rebuild after MessageGroups::singleton()->recache(); in TranslatablePageMoveJob (but I don't know if it would trigger the same error, just in a different place) to make sure the index is up to date and avoid running it inside the hook call. This might fix the immediate issue, but the real issue is that the static variable in the class is not working as expected. It should be true when the hooks are called, but that is not possible because of the deferred updated as far as I can see. The idea is to avoid issues hundred edits to each of a hundred pages when moving a page with hundred translation units. Given the hook call is now deferred, all the hundred edits of each page would be the same. Is the overhead of doing possibly a lot of identical edits okay? These could possibly be done in a job that allows de-duplication. If so, we can remove the static variable altogether. If not, how would we make sure that the PageTranslationHooks::onMoveTranslationUnits is not called for the moves done in TranslatablePageMoveJob?

Change 425484 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Update message index in TranslatablePageMoveJob

https://gerrit.wikimedia.org/r/425484

@Nikerabbit I believe the following happens:

  • There is no line number from TranslatablePageMoveJob because the call stack starts from a deferred update after Job::run() returns.
  • Part of TranslatablePageMoveJob execution is new MovePage( $sourceTitle, $targetTitle )->move(). (TranslatablePageMoveJob.php#L93-L94)
  • When TranslatablePageMoveJob->run() returns in Jobrunner::executeJob(), the next thing is DeferredUpdates::doUpdates().
  • The AtomicSectionUpdate scheduled by MovePage::move() runs.
    • This update runs TitleMoveComplete hooks, which includes PageTranslationHooks::onMoveTranslationUnits. (TranslateHooks.php)

I'd recommend maybe using defined( 'MEDIAWIKI_JOB_RUNNER' ) instead for this boolean check. The only scenario in which jobs run outside a job runner process is stock MW installs without a job runner setup, e.g. where jobs run at the end of web requests via embedded Special:RunJobs).

@Nikerabbit I've landed it as-is for now, but let me know if using defined( 'MEDIAWIKI_JOB_RUNNER' ) works for you. If it does, then we can undo this patch in favour of that, which would help to keep your code simple :)

Change 425484 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Update message index in TranslatablePageMoveJob

https://gerrit.wikimedia.org/r/425484

We shouldn't revert my patch: like I said there it is the correct thing to do. Depending on where the constant is defined, it might break tests. I'll try and see.

Change 427609 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Use better heuristics to skip executing hook for page moves

https://gerrit.wikimedia.org/r/427609

Nikerabbit moved this task from Backlog to Scheduled on the Language-Team board.
Nikerabbit moved this task from Backlog to In Progress on the Language-2018-Apr-June board.

Change 427609 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Use better heuristics to skip executing hook for page moves

https://gerrit.wikimedia.org/r/427609

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM