PHP Notice: JobQueueGroup::__destruct: 1 buffered job(s) never inserted
Closed, ResolvedPublic

Description

Cf. T99849. It's not clear what to do about such notices found in the logs, maybe a stacktrace or clearer message is needed.

-rakkaus:#mediawiki-i18n- [Sat May 23 15:39:43 2015] [hphp] [1080:7fb9987ff700:158:000001] [] \nNotice: JobQueueGroup::__destruct: 1 buffered job(s) never inserted. in /srv/mediawiki/tags/2015-05-23_13:35:45/includes/jobqueue/JobQueueGroup.php on line 419

See also:
T99849: Notice on edit via API: JobQueueGroup::__destruct: 1 buffered job(s) never inserted
T154427: ErrorException with "PHP Notice: JobQueueGroup::__destruct: 1 buffered job ... HTMLCacheUpdateJob never inserted"


SUMMARY OF T165714, T100085, and T154425

All three bugs are related to jobs (background tasks managed asynchronously) and deferred updates (also called deferrable updates, tasks postponed to the end of the request). Some deferred updates can be transformed to jobs. Jobs are usually pushed in the job queue during normal operations, but some jobs are only pushed at a latter time, these are called lazy jobs (short for lazily-pushed jobs). At the end of a MediaWiki request on the Web, deferred updates are executed, then some jobs are executed if the parameter $wgJobRunRate is not 0, and lazy jobs are pushed to the job queue.

  • T165714 (duplicates: T154439) (affects: MW 1.28, 1.29, 1.30) Some jobs throw an exception when executed on the Web due to interactions between jobs and deferred updates linked to database transactions; these jobs are never executed and remain in the job queue until deleted. Bug trigerred when $wgJobRunRate > 0 (default) and the job queue is JobQueueDB (default). Fixed by https://gerrit.wikimedia.org/r/#/c/356120/ on 1.28, 1.29, 1.30.
  • T100085 (duplicates: T154427) (affects: MW: 1.27, 1.28, 1.29, 1.30) Some lazy jobs added by jobs executed on the Web are not pushed and this triggers an error in the logs; these jobs are never added in the job queue, hence never executed. Bug trigerred when $wgJobRunRate > 0 (default). Fixed by https://gerrit.wikimedia.org/r/#/c/356120/ on 1.27, 1.28, 1.29, 1.30. The fix from T154425 improves the resolution. Possibly other root causes given it happened on Translatewiki and Wikimedia (where $wgJobRunRate is 0).

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
ashley reopened this task as Open.Jan 13 2017, 3:37 AM
ashley added subscribers: lcawte, ashley.

Going to be bold and reopen this. Ever since we upgraded ShoutWiki to MW 1.28 last December we've seen plenty of these errors, which manifest as RC entries which shouldn't be there (T153849/T154425). Because the jobs don't get inserted, the RC rows don't get deleted...but the big question here is "why?". I've been trying to track down this issue ever since the upgrade, so any and all pointers are more than welcome.

Tuxxic added a subscriber: Tuxxic.Feb 10 2017, 3:48 PM
aaron removed aaron as the assignee of this task.Feb 17 2017, 9:30 PM

Another report in Project:Support Desk affecting the MassMessage extension: https://www.mediawiki.org/wiki/Topic:Tozjaee579yfxbqw

Gilles assigned this task to Krinkle.Apr 26 2017, 6:36 PM
Gilles moved this task from Inbox to Next-up on the Performance-Team board.

This was originally reported in 2015 about MediaWiki 1.25. The following patch was created in attempt to resolve it:
rMW016fd0b7762d: Refactored entry points to have uniform shutdown handling

Logstash: type=mediawiki "never inserted"

A few samples:

ErrorException from line 449 of /srv/mediawiki/php-1.29.0-wmf.20/includes/jobqueue/JobQueueGroup.php

  • PHP Notice: JobQueueGroup::__destruct: 5 buffered job(s) of type(s) CategoryMembershipChangeJob, CdnPurgeJob, RecentChangesUpdateJob, HTMLCacheUpdateJob never inserted.
    • POST sco.wikipedia /w/api.php
  • PHP Notice: JobQueueGroup::__destruct: 6 buffered job(s) of type(s) EnqueueJob, CategoryMembershipChangeJob, CdnPurgeJob, RecentChangesUpdateJob, HTMLCacheUpdateJob never inserted.
    • POST astwiki /w/api.php
  • PHP Notice: JobQueueGroup::__destruct: 5 buffered job(s) of type(s) CategoryMembershipChangeJob, CdnPurgeJob, RecentChangesUpdateJob, HTMLCacheUpdateJob never inserted.
    • POST wikidatawiki /w/api.php
  • PHP Notice: JobQueueGroup::__destruct: 6 buffered job(s) of type(s) CategoryMembershipChangeJob, CdnPurgeJob, RecentChangesUpdateJob, EnotifNotifyJob, HTMLCacheUpdateJob never inserted.
    • POST dewiki /w/index.php?title=Wikipedia%3ABeitragszahlen%2FArtikel-Bearbeitungen&action=submit
  • PHP Notice: JobQueueGroup::__destruct: 7 buffered job(s) of type(s) EnqueueJob, CategoryMembershipChangeJob, CdnPurgeJob, RecentChangesUpdateJob, EnotifNotifyJob, HTMLCacheUpdateJob never inserted.
    • POST dewiki /w/index.php?title=Wikipedia%3ABeitragszahlen&action=submit
Krinkle renamed this task from Unactionable Notice: JobQueueGroup::__destruct: 1 buffered job(s) never inserted to PHP Notice: JobQueueGroup::__destruct: 1 buffered job(s) never inserted.Apr 26 2017, 9:46 PM
Krinkle removed Krinkle as the assignee of this task.
Krinkle moved this task from Resolved to Production Impact on the Wikimedia-production-error board.
Krinkle added a subscriber: Krinkle.

Per previous comment by Ciencia Al Poder

Zppix added a subscriber: Zppix.May 16 2017, 7:32 PM

These were user-provided screenshots from mediawiki channel on IRC

These were user-provided screenshots from mediawiki channel on IRC

I am the taker of these screenshots, and I must mention I only started having this issue after upgrading from MW 1.26 to 1.28.2. The bug has prevented me from moving pages that have been marked for translation (through Extension:Translate), the site renaming the core page but freezing up at any of the subpages. I can verify that moving a page that has not been marked functions as one would expect. I have yet to experiment with runJobs.php, but I can provide whatever extra info I have that is necessary for a solution.

Legoktm added a subscriber: Legoktm.

This isn't being caused by MassMessage.

Here's how to reproduce this:

diff --git a/includes/specials/SpecialBlankpage.php b/includes/specials/SpecialBlankpage.php
index e61f12b..6cc50d0 100644
--- a/includes/specials/SpecialBlankpage.php
+++ b/includes/specials/SpecialBlankpage.php
@@ -35,5 +35,7 @@ class SpecialBlankpage extends UnlistedSpecialPage {
        public function execute( $par ) {
                $this->setHeaders();
                $this->getOutput()->addWikiMsg( 'intentionallyblankpage' );
+               JobQueueGroup::singleton()->lazyPush( new NullJob( Title::newMainPage(), [] ) );
+               throw new RuntimeException('ugh');
        }
 }

So basically if an exception is thrown so the MediaWiki::doPostOutputShutdown() call isn't reached, the jobs get discarded.

What is the expected behavior here? Try to submit the jobs anyways? Or treat it like a db rollback and discard the not-submitted jobs?

I would say the expected behavior is that no exceptions get thrown :D How can we find out what those exceptions are and fix them?

I would say the expected behavior is that no exceptions get thrown :D How can we find out what those exceptions are and fix them?

I'm not sure, shouldn't exceptions be going to the normal exception log?

I would say the expected behavior is that no exceptions get thrown :D How can we find out what those exceptions are and fix them?

I'm not sure, shouldn't exceptions be going to the normal exception log?

I assume they are, aren't they? The reason this notice ends up being logged isn't because the exception is non-fatal, but because it gets logged from a native shutdown or destruct handler, which can run even after the exception has been dealt with.

This bug should only happen if the exception is left uncaught until a very high level handler (probably OutputPage::output or MediaWiki.php or some such). At that point, we cannot decide to rollback everything (since it's not a DB exception, which we handle more explicitly), and also should not decide to insert them (since something clearly went wrong).

Aborting mid-way is the correct behaviour. Any DB consistency should already be taken care of by other logic. The notice represents the truth: A job was queued, but execution was aborted before it could be inserted with confidence.

The right solution is to either 1) Fix the exception, or 2) If the exception is genuine, handle it property by catching it. In addition, it may be appropiate to, if a try/catch is added, to move the job queue insertion to outside that block (before or after it).

Aaron and I (very) briefly discussed this on IRC yesterday:

[21:19:52] <wikibugs> Performance-Team, MediaWiki-JobQueue, MW-1.27-release, MW-1.29-release, Wikimedia-log-errors: PHP Notice: JobQueueGroup::__destruct: 1 buffered job(s) never inserted - https://phabricator.wikimedia.org/T100085#3285367 (Legoktm) Here's how to reproduce this: ```lang=diff diff --git a/includes/...
[21:21:25] <legoktm> AaronSchulz: thoughts ^ ?
[21:26:20] <AaronSchulz> looks like callers that want rollback() already use onTransactionIdle() and the method does not mention rollback.
[21:26:40] <AaronSchulz> I'd suggest making the exception handler try to push them and clarifying the docs for lazyPush().
[21:42:27] <AaronSchulz> legoktm: where you thinking of patching?
[21:43:30] <legoktm> I didn't get that far
[21:44:10] <legoktm> I spend the past couple of days trying to reproduce it and it was obvious the whole time :P
[21:45:02] <legoktm> relatedly, deferredupdates that get queued but don't run because of an exception also get silently dropped I assume?
[21:56:41] <AaronSchulz> they still run
[21:57:02] <AaronSchulz> though some of them (like AtomicSectionUpdate or CallableUpdate with a $dbw handle will cancel)

I think having the exception handler attempt to push them seems reasonable.

Change 355476 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Push lazy jobs when exceptions are handled by MWExceptionHandler

https://gerrit.wikimedia.org/r/355476

Here's how to reproduce this:

diff --git a/includes/specials/SpecialBlankpage.php b/includes/specials/SpecialBlankpage.php
index e61f12b..6cc50d0 100644
--- a/includes/specials/SpecialBlankpage.php
+++ b/includes/specials/SpecialBlankpage.php
@@ -35,5 +35,7 @@ class SpecialBlankpage extends UnlistedSpecialPage {
        public function execute( $par ) {
                $this->setHeaders();
                $this->getOutput()->addWikiMsg( 'intentionallyblankpage' );
+               JobQueueGroup::singleton()->lazyPush( new NullJob( Title::newMainPage(), [] ) );
+               throw new RuntimeException('ugh');
        }
 }

So basically if an exception is thrown so the MediaWiki::doPostOutputShutdown() call isn't reached, the jobs get discarded.

What is the expected behavior here? Try to submit the jobs anyways? Or treat it like a db rollback and discard the not-submitted jobs?

That remembers me of T85762: Get rid of the Uncommitted DB writes notice. If this notice is masking the real error, kill it with fire please

Seb35 added a subscriber: Seb35.May 31 2017, 9:09 AM

I’m not sure the root cause of this bug is the same for everybody reporting here, because there two very different cases: $wgJobRunRate = 0 and $wgJobRunRate > 0, and given the situation changed with the Gerrit change 212485.

In the case $wgJobRunRate > 0 only (*), the following explanation apply:

  1. some extensions (verified in Translate, but there are other extensions) do some operations, e.g. they modify pages (TranslateRenderJob in the case of Translate) with WikiPage::doEditContent(),
  2. when $wgJobRunRate > 0 the job is executed in MediaWiki’s post-connection shutdown (MediaWiki::restInPeace() and specifically in MediaWiki::triggerJobs()); at this time it is too late for lazy-pushing jobs with JobQueueGroup
  3. in the job, WikiPage::doEditContent() is called, it then call WikiPage::doModify() (or WikiPage::doCreate(), the same reasoning applies)
  4. WikiPage::doModify() update the recent changes by calling DeferredUpdates::addUpdate( new AtomicSectionUpdate() )
  5. Which is directly called since we are in post-connection shutdown, hence WikiPage::doEditUpdates() is called
  6. in some specific conditions (here $wgRCWatchCategoryMembership and some state conditions), WikiPage::doEditUpdates() calls JobQueueGroup::lazyPush, but it is too late and these jobs are never pushed in the job queue

I followed here a specific path, but there is probably a lot of other pathes. In summary it is a specific state depending of the activated extensions, the jobs these extensions are using, the parameter $wgJobRunRate > 0, and other parameters depending on the cases. Probably it would be a bad resolution path to check for every extension and every path which could potentially trigger this bug, but instead either:

  1. Check in JobQueueGroup::lazyPush() if DeferredUpdates::$immediateMode is true, and in this case JobQueueGroup::push() directly
  2. Add in MediaWiki::restInPeace() a JobQueueGroup::pushLazyJobs() *after* MediaWiki::triggerJobs() to push remaining lazy-pushed jobs
  3. both resolution paths

The reasoning here is very similar to T165714 (see also my comment T154425#3299505) where DeferredUpdates are directly executed but is triggering a transaction-in-transaction forbidden by LBFactory::beginMasterChanges(). In my humble opinion,

  1. DeferredUpdates should be always postponed even in MediaWiki::restInPeace() (as I did in https://gerrit.wikimedia.org/r/#/c/356120/ in the class DeferredUpdates)
  2. be executed after MediaWiki::triggerJobs() (as I did in https://gerrit.wikimedia.org/r/#/c/356120/ in the class MediaWiki) to be sure the environnement for the jobs is always the same regarding DeferredUpdates (never directly called)
  3. and one of the above resolution path for JobQueueGroup should be chosen; I tested the second resolution path (add a second JobQueueGroup::pushLazyJobs() after the second DeferredUpdates::doUpdates( 'enqueue' )) and there are no more JobQueueGroup::__destruct warnings (solving this bug), but I’m not sure if it will solve all situations, possibly one of the above resolution path should be done (instead or in addition).

(*) to wiki administrator reporting this bug here: please say the value of $wgJobRunRate, particularly if it is zero, and the exact list of extensions you are using.

All my wikis have $wgJobRunRate = 0. That probably explains why I have not been seeing this error message myself recently.

And TWN had previously $wgJobRunRate > 0 at the time the error message occured? (if you can remember, could it be correlated?)

It has been zero for a long long time, so I find it unlikely.

demon raised the priority of this task from High to Unbreak Now!.May 31 2017, 6:40 PM
Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptMay 31 2017, 6:40 PM

Change 356120 had a related patch set uploaded (by Seb35; owner: Seb35):
[mediawiki/core@master] Execute DeferredUpdates and pushLazyJobs after executing jobs

https://gerrit.wikimedia.org/r/356120

Change 356120 merged by jenkins-bot:
[mediawiki/core@master] Better handling of jobs execution in post-connection shutdown

https://gerrit.wikimedia.org/r/356120

Seb35 added a comment.Jun 2 2017, 8:20 AM

@ all: should be fixed in standard conditions for temperature and pressure in master version of MediaWiki. Could you test/verify if you still see this error? – I’m a bit afraid since Translatewiki has no Web-executed jobs, so no sure the patch will fix all cases.

Change 356799 had a related patch set uploaded (by Seb35; owner: Seb35):
[mediawiki/core@REL1_29] Better handling of jobs execution in post-connection shutdown

https://gerrit.wikimedia.org/r/356799

Seb35 added a comment.Jun 2 2017, 8:50 AM

Cherry-picked to REL1_29. I guess it should be backported to REL1_27 and REL1_28, but I’m not sure. Will check later, except if someone is quicker.

Cherry-picked to REL1_29. I guess it should be backported to REL1_27 and REL1_28, but I’m not sure. Will check later, except if someone is quicker.

Thanks a lot @Seb35 for your effort tackling this issue! I only use cron to work on my job queue so I am not the best tester for client triggerd executions. I believe this fix should be backported to REL1_27 and REL1_28.

Change 356799 merged by jenkins-bot:
[mediawiki/core@REL1_29] Better handling of jobs execution in post-connection shutdown

https://gerrit.wikimedia.org/r/356799

Change 356880 had a related patch set uploaded (by Seb35; owner: Seb35):
[mediawiki/core@REL1_28] Better handling of jobs execution in post-connection shutdown

https://gerrit.wikimedia.org/r/356880

Seb35 added a comment.Jun 2 2017, 7:11 PM

Backported to 1.28, which is very similar to 1.29 in this area -- I was able to reproduce the same behaviour with Translate.

However it’s more different for 1.27 since it seems to work: Translate jobs are executed through the Web interface. Hence the task T165714 seems to be non-existant on 1.27 but this task T100085 is still experienced -- in this, it was not a so great idea to solve both bugs with the same gerrit patch. I will prepare a specific gerrit patch for 1.27.

Change 356887 had a related patch set uploaded (by Seb35; owner: Seb35):
[mediawiki/core@REL1_27] Better handling of jobs execution in post-connection shutdown

https://gerrit.wikimedia.org/r/356887

Change 356880 merged by jenkins-bot:
[mediawiki/core@REL1_28] Better handling of jobs execution in post-connection shutdown

https://gerrit.wikimedia.org/r/356880

Change 356887 merged by jenkins-bot:
[mediawiki/core@REL1_27] Better handling of jobs execution in post-connection shutdown

https://gerrit.wikimedia.org/r/356887

Seb35 updated the task description. (Show Details)Jun 7 2017, 7:50 AM
Seb35 closed this task as Resolved.Jun 8 2017, 8:43 AM

Resolved in 1.27-1.30 (see summary in description for an overview of the two similar bugs).

Seb35 updated the task description. (Show Details)Jun 13 2017, 5:17 PM

@aaron I added some documentation on Manual:Job queue and Manual:Job queue/For developers; could you review it and particularly the differences between deferrable updates and jobs? It should help developers to decide if they want to use a deferrable update and/or a job.

Tuxxic removed a subscriber: Tuxxic.Jun 16 2017, 12:41 PM

@aaron I added some documentation on Manual:Job queue and Manual:Job queue/For developers; could you review it and particularly the differences between deferrable updates and jobs? It should help developers to decide if they want to use a deferrable update and/or a job.

LGTM.

Change 355476 merged by jenkins-bot:
[mediawiki/core@master] Push lazy jobs when exceptions are handled by MWExceptionHandler

https://gerrit.wikimedia.org/r/355476