Page MenuHomePhabricator

Intermittent JobQueueError due to "Unable to deliver all events: 500: Internal Server Error"
Closed, ResolvedPublicPRODUCTION ERROR

Description

[1f1c01c5-5e68-42c7-ba67-eda110393b4a] 2025-02-11 20:11:00: Fatal exception of type "JobQueueError"

I've seen this several times when marking a page for translation. Some of them were misclicks that by serendipity happened to fail. Other times it worked when I retried

Second example:

[9c9f7319-9a91-4ffa-9f00-eeab9f568137] 2025-02-11 20:16:01: Fatal exception of type "JobQueueError"

Third example:

[5fec1826-52be-41da-a55c-177670d52dbd] 2025-02-11 20:50:27: Fatal exception of type "JobQueueError"

Fourth example:

[a4c8f38c-14b3-4e49-910e-2d122ef0c2fe] 2025-02-11 20:58:25: Fatal exception of type "JobQueueError"

Fifth example:

[5b1e2df8-0640-40b5-b55b-6cea428967cd] 2025-02-11 21:00:23: Fatal exception of type "JobQueueError"

From Logstash:

"context":{"message":"event abfd5c7b-2140-4672-b4c9-4b963f591d83 of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.UpdateTranslatablePageJob is not allowed in stream; mediawiki.job.UpdateTranslatablePageJob is not configured."}}]
from /srv/mediawiki/php-1.44.0-wmf.16/extensions/EventBus/includes/Adapters/JobQueue/JobQueueEventBus.php(134)
#0 /srv/mediawiki/php-1.44.0-wmf.16/includes/jobqueue/JobQueue.php(382): MediaWiki\Extension\EventBus\Adapters\JobQueue\JobQueueEventBus->doBatchPush(array, int)
#1 /srv/mediawiki/php-1.44.0-wmf.16/includes/jobqueue/JobQueue.php(354): JobQueue->batchPush(array, int)
#2 /srv/mediawiki/php-1.44.0-wmf.16/includes/jobqueue/JobQueueGroup.php(159): JobQueue->push(array)
#3 /srv/mediawiki/php-1.44.0-wmf.16/extensions/Translate/src/PageTranslation/TranslatableBundleMover.php(213): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.44.0-wmf.16/extensions/Translate/src/PageTranslation/MoveTranslatableBundleSpecialPage.php(151): MediaWiki\Extension\Translate\PageTranslation\TranslatableBundleMover->moveAsynchronously(MediaWiki\Title\Title, MediaWiki\Title\Title, bool, MediaWiki\User\User, string, bool, bool)
#5 /srv/mediawiki/php-1.44.0-wmf.16/includes/specialpage/SpecialPage.php(729): MediaWiki\Extension\Translate\PageTranslation\MoveTranslatableBundleSpecialPage->execute(string)
#6 /srv/mediawiki/php-1.44.0-wmf.16/includes/specialpage/SpecialPageFactory.php(1735): MediaWiki\SpecialPage\SpecialPage->run(string)
#7 /srv/mediawiki/php-1.44.0-wmf.16/includes/actions/ActionEntryPoint.php(503): MediaWiki\SpecialPage\SpecialPageFactory->executePath(string, MediaWiki\Context\RequestContext)
#8 /srv/mediawiki/php-1.44.0-wmf.16/includes/actions/ActionEntryPoint.php(145): MediaWiki\Actions\ActionEntryPoint->performRequest()
#9 /srv/mediawiki/php-1.44.0-wmf.16/includes/MediaWikiEntryPoint.php(202): MediaWiki\Actions\ActionEntryPoint->execute()
#10 /srv/mediawiki/php-1.44.0-wmf.16/index.php(58): MediaWiki\MediaWikiEntryPoint->run()
#11 /srv/mediawiki/w/index.php(3): require(string)
#12 {main}

Event Timeline

Pppery updated the task description. (Show Details)
Nikerabbit subscribed.
Could not enqueue jobs for stream mediawiki.job.UpdateTranslatablePageJob
"context":{"message":"event abfd5c7b-2140-4672-b4c9-4b963f591d83 of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.UpdateTranslatablePageJob is not allowed in stream; mediawiki.job.UpdateTranslatablePageJob is not configured."}}]

I am not aware of any recent changes in Translate or mediawiki-config around this code.

I'm tempted to mark this as UBN! but used High assuming workaround is to try again. MWI is listed as maintainers of job queue, can you help to investigate?

Happening for me as well (when moving translatable pages). Moved ~10 pages and got the error for 3 of them. I looked through the logs and found this:

2025-02-13 20:22:40.956121 [7231ad34-0aee-4f35-aeef-c37b19d353e2] mw-web.codfw.main-7f7789979b-h5ccw metawiki 1.44.0-wmf.16 EventBus ERROR: Could not enqueue jobs for stream mediawiki.job.MoveTranslatableBundleJob {"stream":"mediawiki.job.MoveTranslatableBundleJob","invalidresponse":{"stdClass":["Unable to deliver all events: 500: Internal Server Error"]}}

For the record, a traceback for one of the errors:

from /srv/mediawiki/php-1.44.0-wmf.16/extensions/EventBus/includes/Adapters/JobQueue/JobQueueEventBus.php(134)
#0 /srv/mediawiki/php-1.44.0-wmf.16/includes/jobqueue/JobQueue.php(382): MediaWiki\Extension\EventBus\Adapters\JobQueue\JobQueueEventBus->doBatchPush(array, int)
#1 /srv/mediawiki/php-1.44.0-wmf.16/includes/jobqueue/JobQueue.php(354): JobQueue->batchPush(array, int)
#2 /srv/mediawiki/php-1.44.0-wmf.16/includes/jobqueue/JobQueueGroup.php(159): JobQueue->push(array)
#3 /srv/mediawiki/php-1.44.0-wmf.16/extensions/Translate/src/PageTranslation/TranslatableBundleMover.php(213): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.44.0-wmf.16/extensions/Translate/src/PageTranslation/MoveTranslatableBundleSpecialPage.php(151): MediaWiki\Extension\Translate\PageTranslation\TranslatableBundleMover->moveAsynchronously(MediaWiki\Title\Title, MediaWiki\Title\Title, bool, MediaWiki\User\User, string, bool, bool)
#5 /srv/mediawiki/php-1.44.0-wmf.16/includes/specialpage/SpecialPage.php(729): MediaWiki\Extension\Translate\PageTranslation\MoveTranslatableBundleSpecialPage->execute(string)
#6 /srv/mediawiki/php-1.44.0-wmf.16/includes/specialpage/SpecialPageFactory.php(1735): MediaWiki\SpecialPage\SpecialPage->run(string)
#7 /srv/mediawiki/php-1.44.0-wmf.16/includes/actions/ActionEntryPoint.php(503): MediaWiki\SpecialPage\SpecialPageFactory->executePath(string, MediaWiki\Context\RequestContext)
#8 /srv/mediawiki/php-1.44.0-wmf.16/includes/actions/ActionEntryPoint.php(145): MediaWiki\Actions\ActionEntryPoint->performRequest()
#9 /srv/mediawiki/php-1.44.0-wmf.16/includes/MediaWikiEntryPoint.php(202): MediaWiki\Actions\ActionEntryPoint->execute()
#10 /srv/mediawiki/php-1.44.0-wmf.16/index.php(58): MediaWiki\MediaWikiEntryPoint->run()
#11 /srv/mediawiki/w/index.php(3): require(string)
#12 {main}

Looks like the scheduling endpoint is intermittently returning 500s (=an issue outside of MediaWiki).

Urbanecm renamed this task from Intermittent JobQueueError when marking for translation to Intermittent JobQueueError due to "Unable to deliver all events: 500: Internal Server Error".Feb 13 2025, 8:42 PM

Filled T386410: TranslatableBundleMover does not lift the lock if an asynchronous move failed to be schedule as a related issue (when moving a translatable page, it erroneously stays locked even though the job was not actually scheduled, and thus no move is happening).

It looks like eventgate has been experiencing a higher error rate for the last few days which might explain some failures to enqueue

Investigating.

@hnowlan help me understand the envoy_cluster_upstream_rq metric real quick. This is eventgate's local envoy proxy reporting that it is responding with 5xx, correct?

And, this graph shows the requesting side, correct? MW's envoy getting 5xx from eventgate-main?

Am I reading that correctly?

Strange that the eventgate-main service itself doesn't report this error!
https://grafana.wikimedia.org/goto/CKaiH7cHg?orgId=1

When eventgate reports "stream <stream_name> is not allowed in stream; <stream_name> is not configured", it means that the stream could not be found in stream config.

This shouldn't happen for mediawiki.job.* streams, because they are configured as a regex wildcard match. Any stream that is named like mediawiki.job.jobNameHere will return a stream config block, so they should never be 'not configured'.

e.g.

curl 'https://meta.wikimedia.org/w/api.php?action=streamconfigs&streams=mediawiki.job.cirrusSearchDeleteArchive' | jq .
# and
curl 'https://meta.wikimedia.org/w/api.php?action=streamconfigs&streams=mediawiki.job.madeUpNotRealJob' | jq .

Both work. This is how MediaWiki can produce job events with out pre-configuring the streams it will produce to. (I would like to change how this works one day...)

eventgate-main is configured to permanently cache in memory the stream configs it fetches from action=streamconfigs.

In {T326002#9262081} we simplified the behavior of the stream config caching. We implemented caching of regex matches in the eventgate service. EventStreamConfig API's regex stream feature is not being used by eventgate. eventgate is matching the requested stream name against the fetched configured regex, and then if it matches, caching the result locally.

Here is the code that is saving the matched stream config result in the local memory cache.

And here is the code that is throwing the error message we are seeing.

The result is not being cached (in this._streamConfigs) as no match is found.

eventgate-wikimedia has an endpoint to dump its cached stream configs (for debugging exactly this kind of thing). I can reproduce this.

curl 'https://eventgate-main.discovery.wmnet:4492/v1/stream-configs/mediawiki.job.MoveTranslatableBundleJob' | jq .

In eqiad, all pods respond with configs for mediawiki.job.MoveTranslatableBundleJob. In codfw, only 4 out of 6 pods do.

for ip in $(kubectl get pods -o wide -o json | jq -r .items[].status.podIPs[0].ip); do
    echo $ip
    curl -k "https://$ip:4492/v1/stream-configs/mediawiki.job.MoveTranslatableBundleJob"
    echo ""
    echo "---"
done

Somehow the stream config is not cached properly on some pods, but on others it is? I'm sure how this could happen, as a non-match should not be cached!

Investigating logs on an offending pod when making this request, I get:

Stream mediawiki.job.MoveTranslatableBundleJob is not present in stream configs loaded from http://127.0.0.1:6500/w/api.php?format=json&action=streamconfigs&constraints=destination_event_service=eventgate-main

This is a debug log, and is logged when the cached stream config does not have an key for the requested stream.

I can see that eventgate-main was restarted 9 days ago in eqiad, and 8 days ago in codfw.

I'm not sure what is going on here, but I think a reboot of eventgate-main in codfw will help.

Proceeding: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Roll_restart_all_pods

Mentioned in SAL (#wikimedia-operations) [2025-02-14T16:00:56Z] <ottomata> roll restart eventgate-main in codfw for T386138

Mentioned in SAL (#wikimedia-operations) [2025-02-14T16:04:32Z] <ottomata> roll restart eventgate-main in codfw for T386138 -- the previous command roll restarted in eqiad.

All pods now respond with stream config for mediawiki.job.MoveTranslatableBundleJob.

I'm at a loss for how this could happen, but I expect that (for now) these errors will disappear. Let's wait and see.

Shall we close this task then? I assume nobody is going to investigate how to prevent this from happening again.