Page MenuHomePhabricator

[16hrs] Investigate Infrequent JobQueueErrors on Wikidata
Closed, ResolvedPublic

Description

While editing Wikidata I noticed the bot crashing every once in a while with a job queue error:

WARNING: API error internal_api_error_JobQueueError: [X9g2iApAIDMAAIpUyLoAAABI] Caught exception of type JobQueueError
ERROR: Detected MediaWiki API exception internal_api_error_JobQueueError: [X9g2iApAIDMAAIpUyLoAAABI] Caught exception of type JobQueueError
[errorclass: JobQueueError;
 servedby: mw1339]; 

pywikibot.exceptions.OtherPageSaveError: Edit to page [[wikidata:Q104236239]] failed:
internal_api_error_JobQueueError: [X9wzpwpAAE0AAI7JHhwAAABT] Caught exception of type JobQueueError
[errorclass: JobQueueError;
 servedby: mw1282]

We want to understand what leads to the exceptions i.e. in what cases exceptions are thrown.
It is likely hard to reproduce locally. We might want to look what triggers the job and analyze the code which is involved.
The investigation is going to be timeboxed to 16 hours 15h

Event Timeline

[X9wzpwpAAE0AAI7JHhwAAABT] /w/api.php   JobQueueError from line 117 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/EventBus/includes/Adapters/JobQueue/JobQueueEventBus.php: Could not enqueue jobs from stream mediawiki.job.constraintsRunCheck

Unable to deliver all events: 503: Service Unavailable

image.png (477×1 px, 73 KB)

https://logstash.wikimedia.org/goto/03d464dc89808855e456f57c2a73cf0d

Looks like there was quite a peak in December.
Seems to still be happening a bit
Happened a few thousand times in the last month.

We should investigate and figure out what the impact is and if there is anything that we can do better.

The issue seem to happen from different jobs - i.e. not specific to a particular behaviour.
First step: investigate when it actually happens so that we can think of ways of fixing it.

WMDE-leszek renamed this task from Infrequent JobQueueErrors on Wikidata to Investigate Infrequent JobQueueErrors on Wikidata.Jan 5 2021, 2:11 PM
WMDE-leszek updated the task description. (Show Details)
WMDE-leszek renamed this task from Investigate Infrequent JobQueueErrors on Wikidata to Investigate Infrequent JobQueueErrors on Wikidata [16hrs].Jan 5 2021, 2:16 PM
WMDE-leszek updated the task description. (Show Details)
noarave renamed this task from Investigate Infrequent JobQueueErrors on Wikidata [16hrs] to [16hrs] Investigate Infrequent JobQueueErrors on Wikidata.Jan 5 2021, 2:49 PM

One of my robots ran non-stop for the last week so looks like it's not happening at the moment. You got to love intermittent problems.....

Ha, right after I posted that my bot crashed twice. Now with internal API errors:

WARNING: API error internal_api_error_Error: [X-TMQQpAMMwAAHuo1NQAAAAE] Caught exception of type Error ERROR: Detected MediaWiki API exception internal_api_error_Error: [X-TMQQpAMMwAAHuo1NQAAAAE] Caught exception of type Error [servedby: mw1362; errorclass: Error]; raising

Just had it again:
pywikibot.data.api.APIMWException: internal_api_error_JobQueueError: [X-VEtQpAIDkAAHaGUqkAAADW] Caught exception of type JobQueueError
[servedby: mw1345;
errorclass: JobQueueError]

The first one looks like something else

[X-TMQQpAMMwAAHuo1NQAAAAE] /w/api.php?maxlag=6&action=wbgetentities&ids=Q104533556&format=json   Error from line 110 of /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/lib/includes/Formatters/OutputFormatValueFormatterFactory.php: Class 'Wikibase\Lib\Formatters\EormatterLabelDescriptionLookupFactory' not found

Second one looks relevant to this ticket

X-VEtQpAIDkAAHaGUqkAAADW

Could not enqueue jobs from stream mediawiki.job.constraintsRunCheck

#0 /srv/mediawiki/php-1.36.0-wmf.22/includes/jobqueue/JobQueue.php(374): MediaWiki\Extension\EventBus\Adapters\JobQueue\JobQueueEventBus->doBatchPush(array, integer)
#1 /srv/mediawiki/php-1.36.0-wmf.22/includes/jobqueue/JobQueue.php(344): JobQueue->batchPush(array, integer)
#2 /srv/mediawiki/php-1.36.0-wmf.22/includes/jobqueue/JobQueueGroup.php(171): JobQueue->push(array)
#3 /srv/mediawiki/php-1.36.0-wmf.22/extensions/WikibaseQualityConstraints/src/WikibaseQualityConstraintsHooks.php(64): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.36.0-wmf.22/includes/HookContainer/HookContainer.php(333): WikibaseQuality\ConstraintReport\WikibaseQualityConstraintsHooks::onWikibaseChange(Wikibase\Repo\Notifications\RepoItemChange)
#5 /srv/mediawiki/php-1.36.0-wmf.22/includes/HookContainer/HookContainer.php(140): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#6 /srv/mediawiki/php-1.36.0-wmf.22/includes/Hooks.php(137): MediaWiki\HookContainer\HookContainer->run(string, array, array)
#7 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/Notifications/HookChangeTransmitter.php(37): Hooks::run(string, array)
#8 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/Notifications/ChangeNotifier.php(263): Wikibase\Repo\Notifications\HookChangeTransmitter->transmitChange(Wikibase\Repo\Notifications\RepoItemChange)
#9 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/Notifications/ChangeNotifier.php(203): Wikibase\Repo\Notifications\ChangeNotifier->transmitChange(Wikibase\Repo\Notifications\RepoItemChange)
#10 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/RepoHooks.php(256): Wikibase\Repo\Notifications\ChangeNotifier->notifyOnPageModified(MediaWiki\Revision\RevisionStoreRecord, MediaWiki\Revision\RevisionStoreRecord)
#11 /srv/mediawiki/php-1.36.0-wmf.22/includes/HookContainer/HookContainer.php(333): Wikibase\Repo\RepoHooks::onRevisionFromEditComplete(WikiPage, MediaWiki\Revision\RevisionStoreRecord, boolean, User, array)
#12 /srv/mediawiki/php-1.36.0-wmf.22/includes/HookContainer/HookContainer.php(140): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#13 /srv/mediawiki/php-1.36.0-wmf.22/includes/HookContainer/HookRunner.php(3238): MediaWiki\HookContainer\HookContainer->run(string, array)
#14 /srv/mediawiki/php-1.36.0-wmf.22/includes/Storage/PageUpdater.php(1092): MediaWiki\HookContainer\HookRunner->onRevisionFromEditComplete(WikiPage, MediaWiki\Revision\RevisionStoreRecord, boolean, User, array)
#15 /srv/mediawiki/php-1.36.0-wmf.22/includes/Storage/PageUpdater.php(801): MediaWiki\Storage\PageUpdater->doModify(CommentStoreComment, User, integer)
#16 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/Store/Sql/WikiPageEntityStore.php(375): MediaWiki\Storage\PageUpdater->saveRevision(CommentStoreComment, integer)
#17 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/Store/Sql/WikiPageEntityStore.php(235): Wikibase\Repo\Store\Sql\WikiPageEntityStore->saveEntityContent(Wikibase\Repo\Content\ItemContent, User, string, integer, integer, array)
#18 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/lib/includes/Store/TypeDispatchingEntityStore.php(90): Wikibase\Repo\Store\Sql\WikiPageEntityStore->saveEntity(Wikibase\DataModel\Entity\Item, string, User, integer, integer, array)
#19 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/EditEntity/StatsdSaveTimeRecordingEntityStore.php(53): Wikibase\Lib\Store\TypeDispatchingEntityStore->saveEntity(Wikibase\DataModel\Entity\Item, string, User, integer, integer, array)
#20 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/EditEntity/MediawikiEditEntity.php(741): Wikibase\Repo\EditEntity\StatsdSaveTimeRecordingEntityStore->saveEntity(Wikibase\DataModel\Entity\Item, string, User, integer, integer, array)
#21 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/EditEntity/StatsdSaveTimeRecordingEditEntity.php(77): Wikibase\Repo\EditEntity\MediawikiEditEntity->attemptSave(Wikibase\DataModel\Entity\Item, string, integer, string, boolean, array)
#22 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/Api/EntitySavingHelper.php(370): Wikibase\Repo\EditEntity\StatsdSaveTimeRecordingEditEntity->attemptSave(Wikibase\DataModel\Entity\Item, string, integer, string, NULL, array)
#23 /srv/mediawiki/php-1.36.0-wmf.22/extensions/Wikibase/repo/includes/Api/SetQualifier.php(152): Wikibase\Repo\Api\EntitySavingHelper->attemptSaveEntity(Wikibase\DataModel\Entity\Item, string)
#24 /srv/mediawiki/php-1.36.0-wmf.22/includes/api/ApiMain.php(1607): Wikibase\Repo\Api\SetQualifier->execute()
#25 /srv/mediawiki/php-1.36.0-wmf.22/includes/api/ApiMain.php(587): ApiMain->executeAction()
#26 /srv/mediawiki/php-1.36.0-wmf.22/includes/api/ApiMain.php(558): ApiMain->executeActionWithErrorHandling()
#27 /srv/mediawiki/php-1.36.0-wmf.22/api.php(90): ApiMain->execute()
#28 /srv/mediawiki/php-1.36.0-wmf.22/api.php(45): wfApiMain()
#29 /srv/mediawiki/w/api.php(3): require(string)
#30 {main}

That latest stack trace looks like the job is being queued as part of the main save processing.
And if that queue fails we let the error bubble up to the user.
We should probably defer queueing the job until after we have already responded to the user.

Yeah, I think that’s all we can do here… enqueueing jobs seems to occasionally fail, at random as far as I can tell; we should consider when pushing the job whether a failure to push it would be acceptable or not. If it’s not acceptable, then letting the error bubble to the user seems like the right thing to do (maybe with a better error message, but still rolling back the database transaction); if it is acceptable, then switching from JobQueueGroup::push() to lazyPush() would probably work to hide the error from the user. The constraints job in T270522#6758084 probably belongs in the latter category.

As for the most common job failures in the past 7 days:

  • wikibase-addUsagesForPage: whole hook handler (DataUpdateHookHandler::onParserCacheSaveComplete()) is already done in a deferred update (default stage, i.e. postsend); should be fine as-is
  • CleanTermsIfUnused: scheduled (by DatabaseTermStoreWriterBase::submitJobToCleanTermStorageRowsIfUnused()) as onTransactionCommitOrIdle() callback; should be fine as-is
  • constraintsRunCheck: directly in WikibaseQualityConstraintsHooks::onWikibaseChange(); should probably use lazyPush()
  • ChangeNotification: lazyPush() in JobQueueChangeNotificationSender::sendNotification(), called by dispatchChanges.php maintenance script (therefore not deferred); should be fine as-is
  • many others: not Wikibase

Change 657141 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseQualityConstraints@master] Use lazyPush() for CheckConstraintsJob

https://gerrit.wikimedia.org/r/657141

Change 657141 merged by jenkins-bot:
[mediawiki/extensions/WikibaseQualityConstraints@master] Use lazyPush() for CheckConstraintsJob

https://gerrit.wikimedia.org/r/657141

Once that change is deployed (next week), I expect we’ll still see roughly the same number of job queue errors in logstash, but they should no longer lead to request failures so often.

I haven't specifically verified anything here, but I saw the changes and agree with what is said by Lucas above.