Page MenuHomePhabricator

Delete action throws a DBUnexpectedError with "MWCallableUpdate::doUpdate: Cannot flush snapshot because writes are pending (JobQueueDB::doBatchPush)"
Closed, ResolvedPublic

Description

MediaWiki 1.29.0-alpha (54381ead)
PHP	5.6.8 (apache2handler)
MySQL	5.6.24
ICU	54.1

Issue

A delete action TestDelete&action=delete to article "TestDelete" throws a:

User: loading options for user 1 from database.
OutputPage::sendCacheControl: private caching;  **
[exception] [11f4441e5372dd70d3d37066] /mw-master/index.php?title=TestDelete&action=delete   DBUnexpectedError from line 2851 of ...\mw-master\includes\libs\rdbms\database\Database.php: MWCallableUpdate::doUpdate: Cannot flush snapshot because writes are pending (JobQueueDB::doBatchPush).
#0 ...\mw-master\includes\libs\rdbms\loadbalancer\LoadBalancer.php(1101): Database->flushSnapshot(string)
#1 [internal function]: LoadBalancer->{closure}(DatabaseMysqli)
#2 ...\mw-master\includes\libs\rdbms\loadbalancer\LoadBalancer.php(1399): call_user_func_array(Closure, array)
#3 ...\mw-master\includes\libs\rdbms\loadbalancer\LoadBalancer.php(1109): LoadBalancer->forEachOpenMasterConnection(Closure)
#4 [internal function]: LoadBalancer->beginMasterChanges(string)
#5 ...\mw-master\includes\libs\rdbms\lbfactory\LBFactory.php(176): call_user_func_array(array, array)
#6 [internal function]: LBFactory->{closure}(LoadBalancer, string, array)
#7 ...\mw-master\includes\libs\rdbms\lbfactory\LBFactorySimple.php(147): call_user_func_array(Closure, array)
#8 ...\mw-master\includes\libs\rdbms\lbfactory\LBFactory.php(179): LBFactorySimple->forEachLB(Closure, array)
#9 ...\mw-master\includes\libs\rdbms\lbfactory\LBFactory.php(200): LBFactory->forEachLBCallMethod(string, array)
#10 ...\mw-master\includes\deferred\DeferredUpdates.php(262): LBFactory->beginMasterChanges(string)
#11 ...\mw-master\includes\deferred\DeferredUpdates.php(225): DeferredUpdates::runUpdate(MWCallableUpdate, LBFactorySimple, integer)
#12 ...\mw-master\includes\deferred\DeferredUpdates.php(133): DeferredUpdates::execute(array, string, integer)
#13 ...\mw-master\includes\MediaWiki.php(891): DeferredUpdates::doUpdates(string)
#14 ...\mw-master\includes\MediaWiki.php(720): MediaWiki->restInPeace(string)
#15 ...\mw-master\includes\MediaWiki.php(739): MediaWiki->{closure}()
#16 ...\mw-master\includes\MediaWiki.php(543): MediaWiki->doPostOutputShutdown(string)
#17 ...\mw-master\index.php(43): MediaWiki->run()
#18 {main}
[Bug56269] Exception thrown with an uncommited database transaction: [11f4441e5372dd70d3d37066] /mw-master/index.php?title=TestDelete&action=delete   DBExpectedError from line 1112 of ...\mw-master\includes\libs\rdbms\loadbalancer\LoadBalancer.php: MWCallableUpdate::doUpdate: Flush failed on server(s) localhost: MWCallableUpdate::doUpdate: Cannot flush snapshot because writes are pending (JobQueueDB::doBatchPush).
#0 [internal function]: LoadBalancer->beginMasterChanges(string)
#1 ...\mw-master\includes\libs\rdbms\lbfactory\LBFactory.php(176): call_user_func_array(array, array)
#2 [internal function]: LBFactory->{closure}(LoadBalancer, string, array)
#3 ...\mw-master\includes\libs\rdbms\lbfactory\LBFactorySimple.php(147): call_user_func_array(Closure, array)
#4 ...\mw-master\includes\libs\rdbms\lbfactory\LBFactory.php(179): LBFactorySimple->forEachLB(Closure, array)
#5 ...\mw-master\includes\libs\rdbms\lbfactory\LBFactory.php(200): LBFactory->forEachLBCallMethod(string, array)
#6 ...\mw-master\includes\deferred\DeferredUpdates.php(262): LBFactory->beginMasterChanges(string)
#7 ...\mw-master\includes\deferred\DeferredUpdates.php(225): DeferredUpdates::runUpdate(MWCallableUpdate, LBFactorySimple, integer)
#8 ...\mw-master\includes\deferred\DeferredUpdates.php(133): DeferredUpdates::execute(array, string, integer)
#9 ...\mw-master\includes\MediaWiki.php(891): DeferredUpdates::doUpdates(string)
#10 ...\mw-master\includes\MediaWiki.php(720): MediaWiki->restInPeace(string)
#11 ...\mw-master\includes\MediaWiki.php(739): MediaWiki->{closure}()
#12 ...\mw-master\includes\MediaWiki.php(543): MediaWiki->doPostOutputShutdown(string)
#13 ...\mw-master\index.php(43): MediaWiki->run()
#14 {main}

SUMMARY OF T165714, T100085, and T154425

All three bugs are related to jobs (background tasks managed asynchronously) and deferred updates (also called deferrable updates, tasks postponed to the end of the request). Some deferred updates can be transformed to jobs. Jobs are usually pushed in the job queue during normal operations, but some jobs are only pushed at a latter time, these are called lazy jobs (short for lazily-pushed jobs). At the end of a MediaWiki request on the Web, deferred updates are executed, then some jobs are executed if the parameter $wgJobRunRate is not 0, and lazy jobs are pushed to the job queue.

  • T165714 (duplicates: T154439) (affects: MW 1.28, 1.29, 1.30) Some jobs throw an exception when executed on the Web due to interactions between jobs and deferred updates linked to database transactions; these jobs are never executed and remain in the job queue until deleted. Bug trigerred when $wgJobRunRate > 0 (default) and the job queue is JobQueueDB (default). Fixed by https://gerrit.wikimedia.org/r/#/c/356120/ on 1.28, 1.29, 1.30.
  • T100085 (duplicates: T154427) (affects: MW: 1.27, 1.28, 1.29, 1.30) Some lazy jobs added by jobs executed on the Web are not pushed and this triggers an error in the logs; these jobs are never added in the job queue, hence never executed. Bug trigerred when $wgJobRunRate > 0 (default). Fixed by https://gerrit.wikimedia.org/r/#/c/356120/ on 1.27, 1.28, 1.29, 1.30. The fix from T154425 improves the resolution. Possibly other root causes given it happened on Translatewiki and Wikimedia (where $wgJobRunRate is 0).

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Paladox added a subscriber: Paladox.Jan 2 2017, 4:17 PM

Given the following message [0], are we going to address (which implicitly means solve) this with 1.29?

[0] https://lists.wikimedia.org/pipermail/wikitech-l/2017-April/088019.html

If you or anyone else plan to provide a patch in Gerrit (see Developer access and the Gerrit tutorial) for review, there is a good chance.

Extension ReplaceText doesn't work under MW 1.28.2. Jobs are added to the queue, however they are never executed. Even running runJobs.php has no effects. Is this the same problem? No chance to get it fixed in MW 1.28?

I'm not sure whether this may be of help. When jobs are present in the queue, the error log shows this message every time a page is displayed:

PHP Deprecated: Automatically populating $HTTP_RAW_POST_DATA is deprecated and will be removed in a future version. To avoid this warning set 'always_populate_raw_post_data' to '-1' in php.ini and use the php://input stream instead. in Unknown on line 0

Aftering having deleted all jobs from the job table the above message disappers.

My error log was just flooded with almost a thousand lines just because I deleted a handful of pages. 100% reproducible for me on master.

I just noticed this also makes cleaning up spam very difficult. Since some jobs don't get run, the entries remain in the recent changes list. Nuke does not realize that some pages have already been deleted so it keeps offering the same pages. The recent changes list also keeps showing pages that have already been deleted as new pages.

Since some jobs don't get run, the entries remain in the recent changes list. Nuke does not realize that some pages have already been deleted so it keeps offering the same pages.

This also happened to me with some mass deletion, where some 20 % of the pages became effectively impossible to delete.

Seb35 added a subscriber: Seb35.May 30 2017, 8:23 AM

I dived into this/these issues after issues with Translate.

The primary issues I experienced with Translate jobs, and I guess issues @Choralia experienced are similar: jobs are inserted but never executed: this is issue T165714 and I was able to solve it with https://gerrit.wikimedia.org/r/#/c/356120/. More precisely, jobs are executed (by the the Web shutdown mechanism, see MediaWiki::doPostOutputShutdown), but because we are in this shutdown place (more precisely because DeferredUpdates::$immediateMode is true) DeferredUpdates are immediately executed – which is not expected by jobs – and, when jobs have already started a DB transaction (a call to LBFactory::beginMasterChanges), the call to LBFactory::beginMasterChanges in DeferredUpdates::runUpdate is triggering the exception DBUnexpectedError. (see backtraces in T165714)

The issue in this task and in T157679 and T154438 have the similar behaviour (DBUnexpectedError “uncommited database transaction”), but it seems different and less critical. After search the exact anonymous function triggering this issue in DeferredUpdates (anonymous functions are… anonymous, not easy to find, but perhaps there exist more powerful debug tools than my var_dump and the like), I finally found the Pingback class. When I comment its DeferredUpdates::addCallableUpdate, this DBUnexpectedError does no more appear. For now I didn’t dive more in this, I’m not sure if the issue is with Pingback or more generally with DeferredUpdates and its management of the DB transactions. Note that the WMF probably does not experience this issue because job queues are managed with Redis and not with the MySQL DB.

demon triaged this task as Unbreak Now! priority.May 31 2017, 6:39 PM
Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptMay 31 2017, 6:39 PM
Seb35 added a comment.Jun 1 2017, 2:05 PM

Well, I’m quite confident in my understanding of this issue. If I’m correct, it happens with some deferred updates, namely EnqueueableDataUpdate – which can be converted to jobs – and it happens with the action "delete", "restore", and "move" because LinksDeletionUpdate and LinksUpdate implement EnqueueableDataUpdate. An other necessary condition is that another deferred updates, *not* an EnqueueableDataUpdate is exectuted at the same time.

  1. In the post-connection shutdown (MediaWiki::restInPeace), deferred updates are executed by calling DeferredUpdates::doUpdates( 'enqueue' ), this is unconditionnal with respect to Web execution of jobs, hence unrelated to tasks T165714 even if the error messages are very similar.
  2. Deferred updates are executed in DeferredUpdates::execute. Here because of mode 'enqueue', DeferredUpdates::enqueueUpdates is called, and the whole stack for pushing jobs is called: JobQueueGroup::push, JobQueue::push, JobQueue::batchPush, JobQueueDB::doBatchPush
  3. in this last function there is $dbw->onTransactionIdle: callback to JobQueueDB::doBatchPushInternal, hence this callback is registered in array Database::mTrxIdleCallbacks.
  4. DeferredUpdates::enqueueUpdates remove the EnqueueableDataUpdate, which is fine, because they will be added the transaction will become idle.
  5. Execution flow continues in DeferredUpdates::execute, and *because there are other deferred updates* (for instance Pingback, and it explains because when I deactivated the Pingback deferred update, there is no more issue), the loop $updatesByType is called, since non-empty
  6. DeferredUpdates::runUpdate is called on the other deferred update, then $lbFactory->beginMasterChanges(), then the stack from the backtrace in this bug description: LBFactory::beginMasterChanges, ..., LoadBalancer::beginMasterChanges (which explains the part of the message "Flush failed on server(s)"), Database::flushSnapshot (which explains the part of the message "Cannot flush snapshot because writes are pending")

QED — if I’m not mistaken

Change 356619 had a related patch set uploaded (by Seb35; owner: Seb35):
[mediawiki/core@master] Use AtomicSectionUpdate instead of Database->onTransactionIdle

https://gerrit.wikimedia.org/r/356619

Seb35 claimed this task.Jun 1 2017, 4:08 PM

The patch doesn't work for me: jobs remain in the queue. The following warning is still reported in the error log:

PHP Deprecated: Automatically populating $HTTP_RAW_POST_DATA is deprecated and will be removed in a future version. To avoid this warning set 'always_populate_raw_post_data' to '-1' in php.ini and use the php://input stream instead. in Unknown on line 0

Seb35 added a comment.Jun 2 2017, 9:13 AM

@Choralia the "PHP Deprecated" is probably unrelated to this specific bug, but possibly comes from a specific job from an extension. And it’s very subtle, but this specific bug is related to a specific job type (called deferred updates) which are not executed but deferred updates are not managed by the job queue, the bug about jobs remaining in the job queue is T165714, which is now fixed in the master version of MediaWiki. If you want to test the patch of the other bug, it should be with the master version (1.30.0-alpha) and it should be included also in 1.29.0 (probably in 1.29.0-rc.1).

https://gerrit.wikimedia.org/r/356619

I did a quick test with the PR and it looks promising which also seems to address T154438.

mwjames added a comment.EditedJun 3 2017, 8:24 AM

I did a quick test with the PR and it looks promising which also seems to address

Something doesn't fit here, I'm losing (meaning there are never inserted) jobs on those that are `lazypush.

I set a log point and I can see that JobQueueGroup::lazyPush and push are reached but it never gets queued or inserted into the job table.

WikiPage::doEditUpdates: Using prepared edit...
Saved in parser cache with key mw-master:pcache:idhash:17651-0!*!0!*!en!*!* and 

[jobqueue] JobQueueGroup::lazyPush O:39:"SMW\MediaWiki\Jobs\ChangePropagationJob":10:{s:20:" * isEnabledJobQueue";b:1;s:7:" * jobs";a:0:{}s:8:" * store";N;s:7:"command";s:24:"SMW\ChangePropagationJob";s:6:"params";a:1:{s:9:"requestId";s:24:"fb0cc230e2d63916e2a80e42";}s:8:"metadata";a:0:{}s:8:" * title";O:5:"Title":7:{s:10:"mNamespace";i:102;s:10:"mDbkeyform";s:10:"ChangeProp";s:9:"mFragment";s:0:"";s:10:"mInterwiki";s:0:"";s:22:" Title mLocalInterwiki";b:0;s:17:" * mUserCaseDBKey";s:10:"ChangeProp";s:17:"mDefaultNamespace";i:0;}s:19:" * removeDuplicates";b:0;s:8:" * error";N;s:20:" * teardownCallbacks";a:0:{}}

...

MediaWiki::preOutputCommit: LBFactory shutdown completed

...

OutputPage::sendCacheControl: private caching;  **
[objectcache] Duplicate get(): "mw-master:prepared-edit:e5aa5f27ed5db7855aa7748ca796a134:7f78a645c6e02815fb2d98b4325e30b6b0878903:90f42dc8c99187877261a42fd0a194b3" fetched 2 times
[jobqueue] JobQueueGroup::push a:1:{i:0;O:39:"SMW\MediaWiki\Jobs\ChangePropagationJob":10:{s:20:" * isEnabledJobQueue";b:1;s:7:" * jobs";a:0:{}s:8:" * store";N;s:7:"command";s:24:"SMW\ChangePropagationJob";s:6:"params";a:1:{s:9:"requestId";s:24:"fb0cc230e2d63916e2a80e42";}s:8:"metadata";a:0:{}s:8:" * title";O:5:"Title":7:{s:10:"mNamespace";i:102;s:10:"mDbkeyform";s:10:"ChangeProp";s:9:"mFragment";s:0:"";s:10:"mInterwiki";s:0:"";s:22:" Title mLocalInterwiki";b:0;s:17:" * mUserCaseDBKey";s:10:"ChangeProp";s:17:"mDefaultNamespace";i:0;}s:19:" * removeDuplicates";b:0;s:8:" * error";N;s:20:" * teardownCallbacks";a:0:{}}}
Request ended normally

After "Request ended normally" no queue entry nor any async run.

Seb35 added a comment.Jun 3 2017, 10:01 AM
  1. Just to be sure: you have both the gerrit patches https://gerrit.wikimedia.org/r/#/c/356120/ (now in master, REL1_29 and REL1_28 branches) and https://gerrit.wikimedia.org/r/#/c/356619/ ? Un-pushed lazy jobs are more specifically solved by https://gerrit.wikimedia.org/r/#/c/356120/ (T100085).
  2. What is your value of $wgJobRunRate ?
  3. How can you be sure they are never inserted in the job queue? To test this sort of things, I tend to set $wgJobRunRate to 0 (so that Web does not execute any jobs), wait at least 30 seconds, and run php maintenance/showJobs.php --group (or --list). The 30 seconds are mandatory because JobQueueDB has a cache of 30 seconds (JobQueueDB::CACHE_TTL_SHORT) -- sometimes I edit this constant to (e.g.) 3 to reduce the delays.

Just to be sure: you have both the gerrit patches

Yes.

The reference PR in this issue on top of https://github.com/wikimedia/mediawiki/commit/13ffedf7278942d810514cb88da54fff97b23617.

What is your value of $wgJobRunRate ?

Using the default value.

How can you be sure they are never inserted in the job queue?

Because I'm watching the job table and the debug log which either should have an entry or if it matches the async process runs directly but in both cases would create a log [runJobs] ... STARTING record.

The more worrying factor is that I executed the same process and now jobs are inserted which seems to indicate some non deterministic (or better opportunistic) behaviour.

The more worrying factor is that I executed the same process and now jobs are inserted which seems to indicate some non deterministic (or better opportunistic) behaviour.

Never mind my last comment, I actually reverted to the onTransactionIdle in doBatchPush to compare the results which showed the expected behaviour of jobs being inserted and executed but as soon as I switch to the proposed AutoCommitUpdate handler, jobs just vanish.

I'm struggling with the patches. Updating MediaWiki.php allows tasks in the job queue to be executed, however the following error is diplayed and logged:

Fatal error: Class 'Wikimedia\Rdbms\LBFactory' not found in /home/test/public_html/wiki/includes/MediaWiki.php on line 921

The extension ReplaceText, which I use to create jobs, also triggers the following error:

Fatal error: Class 'Wikimedia\Rdbms\ChronologyProtector' not found in /home/test/public_html/wiki/includes/MediaWiki.php on line 641

I guess that there is something else to be updated, which I haven't been able to identify. I'm not a programmer, and I'm not familiar with these tools to maintain php code. Sorry if this is of no help.

Seb35 added a comment.Jun 3 2017, 2:26 PM

The more worrying factor is that I executed the same process and now jobs are inserted which seems to indicate some non deterministic (or better opportunistic) behaviour.

Never mind my last comment, I actually reverted to the onTransactionIdle in doBatchPush to compare the results which showed the expected behaviour of jobs being inserted and executed but as soon as I switch to the proposed AutoCommitUpdate handler, jobs just vanish.

@mwjames But when you have the onTransactionIdle version, you do have the "Flush failed on server(s)" exception, isn’t it? Also, when you do apply the two patches, can you give the output of php maintenance/showJobs.php --group, I would like to see what jobs are concerned and what is their execution state (if they are "claimed" and if there are no request through the Web, it could indicate the jobs have failed due to an internal exception).

@mwjames But when you have the onTransactionIdle version, you do have the "Flush failed on server(s)" exception, isn’t it?

Well, I'm testing the patch and while the issue of "Flush failed on server(s)" doesn't appear on the delete or move action, at the same time I have encountered above observation which relates to the proposed patch hence my report about the findings.

can you give the output of can you give the output of php maintenance/showJobs.php --group

I cannot because there are none that's the crux.

mwjames added a comment.EditedJun 3 2017, 3:08 PM

A minimal test case would include:

At the point of:

...
MediaWiki::preOutputCommit: primary transaction round committed
WikiPage::doEditUpdates: Using prepared edit...
Saved in parser cache with key mw-master:pcache:idhash:17651-0!*!0!*!en!*!* and timestamp 20170603150110 and revision id 22520
Title::getRestrictionTypes: applicable restrictions to [[Property:ChangeProp]] are {edit,move}
...

you post a job during SecondaryUpdate or as DeferrableUpdate by simply JobQueueGroup::singleton()->lazyPush( 'MyFakeJob' ), now sooner or later MyFakeJob should appear either as queued job or if your lucky is already scheduled. The problem is that this never happens when one relies on AutoCommitUpdate in doBatchPush.

Seb35 added a comment.Jun 3 2017, 3:29 PM

So it would suggest the issue is in JobQueue::push(). If you set $wgJobRunRate to 0, do you see the job(s) with showJobs.php?

Seb35 added a comment.Jun 3 2017, 3:49 PM

From what I understand in this issue, when you have an EnqueuableDataUpdate (e.g. when you delete a page) and a classical DeferrableUpdate (e.g. Pingback, which is always added):

In the original scenario with onTransactionIdle, the EnqueableDataUpdate(s) is/are added as a callback in DeferredUpdates::execute() (and this callback should be called at a latter time by the Database) then the classical DeferrableUpdate is triggering the warning from this bug description and is not executed (and next deferred updates are not executed and probably the EnqueableDataUpdate(s) is not executed); and when there is no EnqueueableDataUpdate the DeferredUpdates are executed, then JobQueueGroup::pushLazyJobs() (this last step with the patch from T100085) (all this either in MediaWiki::restInPeace() either in JobRunner::execute()).

In the proposed patch with AutoCommitUpdate, the EnqueableDataUpdate(s) are added in the sub-queue of DeferredUpdates (and should be called in a latter time by DeferredUpdates::execute()) then the classical DeferrableUpdate(s) are executed, then the EnqueueableDataUpdate(s) are executed, then JobQueueGroup::pushLazyJobs() is executed (this last step since the patch from T100085) (all this either in MediaWiki::restInPeace() either in JobRunner::execute()). If there is no EnqueueableDataUpdate, this step is not present, but other DeferrableUpdate(s) are executed and JobQueueGroup::pushLazyJobs() is executed.

This is what I understand, but it seems I forget a detail, possibly only in the case when this is no EnqueueableDataUpdate, or possibly a detail in JobQueueGroup::push().

Seb35 added a comment.Jun 3 2017, 4:26 PM

Ok, I am able to reproduce the issue you are observing @mwjames, I investigate.

Seb35 added a comment.Jun 3 2017, 5:11 PM

Ok so when a problem is solved another appears. We are in the shutdown, hence nothing is like normal operation. What happens with the proposed change is to add the lazy jobs as a deferrable update and not with onTransactionIdle (given it’s not possible due to Database mechanisms). So in the shutdown (either MediaWiki::restInPeace() either JobRunner::executeJob()), you execute the deferred updates, you add lazy jobs, and since the lazy jobs are added through a deferrable update, nothing more happens since deferred updates are already executed.

A hacky solution is to add one more call to DeferredUpdates::doUpdates() to execute the last AutoCommitUpdate from JobQueueGroup::pushLazyJobs(). I push the gerrit patch (which solves the issue you are speaking @mwjames) but imho it is a bad solution.

Another solution would be what @aaron proposed: convert JobQueueGroup::lazyPush() to a DeferrableUpdate, hence, theoretically, JobQueueGroup::pushLazyJobs() can be removed and lazy jobs are unified with deferrable updates => if a deferrable update lazy-push a job, it will add it in the sub-queue and will be executed. It would remove the chicken-and-egg issue between DeferredUpdates and pushLazyJobs().

Change 356619 merged by jenkins-bot:
[mediawiki/core@master] Use AutoCommitUpdate instead of Database->onTransactionIdle

https://gerrit.wikimedia.org/r/356619

Seb35 added a comment.Jun 7 2017, 7:01 AM

The patch should be backported to REL1_29 and REL1_28. Should be checked for REL1_27.

Change 357560 had a related patch set uploaded (by Seb35; owner: Seb35):
[mediawiki/core@REL1_29] Use AutoCommitUpdate instead of Database->onTransactionIdle

https://gerrit.wikimedia.org/r/357560

Change 357561 had a related patch set uploaded (by Seb35; owner: Seb35):
[mediawiki/core@REL1_28] Use AutoCommitUpdate instead of Database->onTransactionIdle

https://gerrit.wikimedia.org/r/357561

Seb35 updated the task description. (Show Details)Jun 7 2017, 7:51 AM

Ok, I am able to reproduce the issue you are observing @mwjames, I investigate.

As for the described T154425#3312680 case, the observation no longer occurs with [0] and a delete action also no longer throws a "throws a DBUnexpectedError with "MWCallableUpdate::doUpdate ...".

[0] https://github.com/wikimedia/mediawiki/commit/7f65646f35fca0544e6fc388f884bfedcb9a8948

mwjames added a comment.EditedJun 7 2017, 12:58 PM

As for the described T154425#3312680 case, the observation no longer occurs with

Okay, I spoke too early. Let's take the pseudo FakeDispatchJob as example.

  • T154425#3312680 does not occur when jobs are scheduled using a web request (via Special:RunJobs)
  • When scheduled and executed via runJobs.php and during that process FakeDispatchJob tries to post follow-up jobs either as push or lazyPush then those jobs are lost and never appear in the queue. (PS: This worked before the mentioned 7f65646f3 merge.)

Some pseudo code:

class SecondaryUpdateJob extends Job {

	/**
	 * @see Job::run
	 */
	public function run() {

		// Do something ...

		JobQueueGroup::singleton()->push(
			new FakeDispatchJob( Title::newFromText( 'Foo' ) )
		)

		return true;
	}

}

class FakeDispatchJob extends Job {

	/**
	 * @see Job::run
	 */
	public function run() {

		// Do something ...

		// Push another job on some condition during FakeDispatchJob
		JobQueueGroup::singleton()->push(
			new SpreadFakeJob( Title::newFromText( 'Bar' ) )
		)

		return true;
	}

}

@Seb35 Can you confirm my observation?

Krinkle added a subscriber: Krinkle.Jun 7 2017, 8:20 PM

Change 356619 merged by jenkins-bot:
[mediawiki/core@master] Use AutoCommitUpdate instead of Database->onTransactionIdle
https://gerrit.wikimedia.org/r/356619

Since this was merged, can you @mwjames @Seb35 verify and/or clarify what is left here?

Seb35 added a comment.Jun 8 2017, 9:42 AM

@mwjames @Krinkle I am able to confirm the observation (sadly): the FakeDispatchJob, added by SecondaryUpdateJob, is not added in the job queue when the job queue is executed on CLI (but works when executed on Web).

It impacts DB-managed queues, but I’m not sure if it would impact Redis-managed queues; I tend to think it would not given the AutoCommitUpdate is DB-specific, but I’m not sure given there is also a change in the call to JobQueueGroup::pushLazyJobs (although this specific part is changed from a direct call to a deferred update which should be called by the DeferredUpdates::doUpdates, hence it shouldn’t be very different, at least for a non-DB job queue implementation).

But if WMF/Wikipedia is not confortable with this situation, perhaps it is wiser to revert the patch. (Perhaps Beta cluster logs can help.) (I cannot test Redis queues myself.)

I will continue to search and study in the next few hours.

Seb35 added a comment.Jun 8 2017, 11:58 AM

I localised the issue in DeferredUpdates::tryOpportunisticExecute() [which is pretty expected given it’s a difference between CLI and Web execution of jobs] => the callback is directly executed, but still don’t know why it does not add the job to the job queue.

Seb35 added a comment.Jun 8 2017, 12:55 PM

I pretty sure there is no issue with Redis-managed queues.

It’s a "database transaction" issue, help wanted @aaron @Krinkle.

Commented stack:

  • we are in JobRunner in CLI,
  • a job is executed like the "SecondaryUpdateJob" from @mwjames
  • this job adds another job,
  • "push job" is executed,
  • the AutoCommitUpdate is added,
  • directly executed through DeferredUpdates::tryOpportunisticExecute since we are in CLI,
  • DeferredUpdates::getBusyDbConnections is false (I didn’t checked it, but perhaps the transaction status should be checked here and return false in this case)
  • DeferredUpdates::doUpdates and DeferredUpdates::execute [the POSTSEND queue],
  • DeferredUpdates::runUpdate for our AutoCommitUpdate
  • LBFactory::beginMasterChanges triggers silently an exception because the transaction of the current job is already active

Wikimedia\Rdbms\DBTransactionError: AutoCommitUpdate::doUpdate: transaction round 'SecondaryUpdateJob::run' already started. in /mediawiki/includes/libs/rdbms/lbfactory/LBFactory.php on line 200

Seb35 added a comment.Jun 8 2017, 2:58 PM

I see some resolution path:

  • deactivate DeferredUpdates::tryOpportunisticExecute (it’s quite brutal)
  • add a check to the transaction status in DeferredUpdates::getBusyDbConnections
  • lazy push all jobs in CLI to defer their pushing at the end of the job execution (although it should be carefully designed because pushLazyJobs will push the jobs, and an indefinite loop could be created)

Change 357561 merged by jenkins-bot:
[mediawiki/core@REL1_28] Use AutoCommitUpdate instead of Database->onTransactionIdle

https://gerrit.wikimedia.org/r/357561

Change 357560 merged by jenkins-bot:
[mediawiki/core@REL1_29] Use AutoCommitUpdate instead of Database->onTransactionIdle

https://gerrit.wikimedia.org/r/357560

aaron added a comment.Jun 9 2017, 2:58 AM

The regression is due to a pre-existing bug being exposed with transaction rounds (distinct from individual DB transactions) not being seen in getBusyDbConnections().

aaron@aaron-HP-HDX18-Notebook-PC:~/OSS/core$ php maintenance/eval.php 
> $j=new NullJob( Title::newMainPage(), [ 'lives' => 100 ] ); JobQueueGroup::singleton()->push( $j );

> 
aaron@aaron-HP-HDX18-Notebook-PC:~/OSS/core$ php maintenance/runJobs.php

The exception log (if defined) will show transaction nesting errors.

Change 357966 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make DeferredUpdates detect LBFactory transaction rounds

https://gerrit.wikimedia.org/r/357966

I am able to confirm the observation (sadly)

@Seb35 Thanks for your time and effort.

https://gerrit.wikimedia.org/r/357966

I can confirm that the presented PR solves the issue as noted in T154425#3323418 with the original issue being addressed by https://gerrit.wikimedia.org/r/#/c/356619/, yet I feel some uneasiness given the T154425#3334588 comment which may hint other uncovered issues.

The regression is due to a pre-existing bug being exposed with transaction rounds ...
... will show transaction nesting errors.

This is rather unsettling and just came to light while testing something entirely different on our end. I would expect that MediaWiki-core provides its own integration test to avoid a similar situation in future where DeferredUpdates is altered again and pending processes stop working.

Seb35 added a comment.Jun 9 2017, 7:46 AM

Closed the quasi-duplicates T154438, T157679, T166867, T153849 (same call stack but different user-end actions).

Seb35 added a comment.Jun 9 2017, 8:55 AM

I am able to confirm the observation (sadly)

@Seb35 Thanks for your time and effort.

It’s interesting for me to dive into MediaWiki core. Thanks to you @mwjames for your exacting testing. Thanks also @aaron for you guidance.

https://gerrit.wikimedia.org/r/357966

I can confirm that the presented PR solves the issue as noted in T154425#3323418 with the original issue being addressed by https://gerrit.wikimedia.org/r/#/c/356619/, yet I feel some uneasiness given the T154425#3334588 comment which may hint other uncovered issues.

I guess the other calls to onTransactionIdle would trigger this same error (and exception message). Perhaps it should call an AutoCommitUpdate, but I’ve no enough experience to have an informed advice, there are perhaps use cases where callbacks should not be deferred. But imho it should be discussed in a specific task, this one is almost solved.

The regression is due to a pre-existing bug being exposed with transaction rounds ...
... will show transaction nesting errors.

This is rather unsettling and just came to light while testing something entirely different on our end. I would expect that MediaWiki-core provides its own integration test to avoid a similar situation in future where DeferredUpdates is altered again and pending processes stop working.

Possibly yes, but writing good integration tests need to know what should be tested; the experience gained from these three tasks would be for sure interesting to write some integration tests. I’m not sure if there is a development policy in MediaWiki for integration tests; if no it would be quite an important change (and it would imply developement time used for integration tests instead of writing features, a developement political decision anyway with a much broader scope than these bugs).

Seb35 added a comment.Jun 9 2017, 9:14 AM

I link here some other tasks I found as partly linked to this one - if someone want to check in details and/or discuss these and/or close these:

  • T85762 "Get rid of the Uncommitted DB writes notice" A proposal about hiding these warnings in some cases
  • T85734 "Uncommitted DB writes (transaction from DatabaseBase::query (DatabaseBase::query))" Quite old, possibly solved in newer MediaWiki versions
  • T58269 "Uncommitted DB writes (transaction from DatabaseBase::query (Block::newLoad))" Quite old, possibly solved in newer MediaWiki versions
  • T52778 "nested transaction in update.php when updating collation" Quite old, possibly solved in newer MediaWiki versions
  • T132215 "Do not show recent changes for recently deleted pages" About the design to make as deferred updates for remove RC entries, parly solved (mitigated) by this fix (linked to T153849 "Deleted pages' creation entry shows up in Special:RecentChanges as a redlink since MW 1.28")
Kghbln added a subscriber: Kghbln.Jun 9 2017, 2:15 PM

Change 358172 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@REL1_29] Make DeferredUpdates detect LBFactory transaction rounds

https://gerrit.wikimedia.org/r/358172

Krinkle reassigned this task from Seb35 to aaron.Jun 10 2017, 3:22 PM
Krinkle moved this task from Usage problem to General library on the Mediawiki-Rdbms board.

Change 358172 merged by jenkins-bot:
[mediawiki/core@REL1_29] Make DeferredUpdates detect LBFactory transaction rounds

https://gerrit.wikimedia.org/r/358172

Change 357966 merged by jenkins-bot:
[mediawiki/core@master] Make DeferredUpdates detect LBFactory transaction rounds

https://gerrit.wikimedia.org/r/357966

Krinkle closed this task as Resolved.Jun 11 2017, 7:41 PM
Krinkle removed a project: MW-1.29-release-notes.

Change 358461 had a related patch set uploaded (by Seb35; owner: Aaron Schulz):
[mediawiki/core@REL1_28] Make DeferredUpdates detect LBFactory transaction rounds

https://gerrit.wikimedia.org/r/358461

Change 358461 merged by jenkins-bot:
[mediawiki/core@REL1_28] Make DeferredUpdates detect LBFactory transaction rounds

https://gerrit.wikimedia.org/r/358461

Seb35 updated the task description. (Show Details)Jun 13 2017, 5:16 PM