Page MenuHomePhabricator

revalidateLinkRecommendations.php fails periodically with JobQueueError: Could not enqueue jobs
Closed, ResolvedPublicPRODUCTION ERROR

Description

As part of T371597: Add a Link (Structured task): Release as "turned off" to German Wikipedia and T370802: Add a link (Structured task): Release as "turned off" to English Wikipedia, we needed to revalidate all link recommendations at enwiki and dewiki. This can be done by running revalidateLinkRecommendations.php. Unfortunately, as of August 2024, this script periodically fails with the following:

  Lotteriemonopol is outdated, regenerating... success                                                                                                                                                                                             
  Leonidas Proaño is outdated, regenerating... JobQueueError from line 134 of /srv/mediawiki/php-1.43.0-wmf.16/extensions/EventBus/includes/Adapters/JobQueue/JobQueueEventBus.php: Could not enqueue jobs                                         
#0 /srv/mediawiki/php-1.43.0-wmf.16/includes/jobqueue/JobQueue.php(380): MediaWiki\Extension\EventBus\Adapters\JobQueue\JobQueueEventBus->doBatchPush(Array, 0)                                                                                    
#1 /srv/mediawiki/php-1.43.0-wmf.16/includes/jobqueue/JobQueue.php(352): JobQueue->batchPush(Array, 0)                                                                                                                                             
#2 /srv/mediawiki/php-1.43.0-wmf.16/includes/jobqueue/JobQueueGroup.php(157): JobQueue->push(Array)                                                                                                                                                
#3 /srv/mediawiki/php-1.43.0-wmf.16/extensions/CirrusSearch/includes/Updater.php(476): JobQueueGroup->push(Array)                                                                                                                                  
#4 /srv/mediawiki/php-1.43.0-wmf.16/extensions/CirrusSearch/includes/Updater.php(292): CirrusSearch\Updater->pushElasticaWriteJobs('weighted_tags', Array, Object(Closure))                                                                        
#5 /srv/mediawiki/php-1.43.0-wmf.16/extensions/CirrusSearch/includes/CirrusSearch.php(666): CirrusSearch\Updater->resetWeightedTags(Object(MediaWiki\Page\PageIdentityValue), 'weighted_tags', 'recommendation....')                               
#6 /srv/mediawiki/php-1.43.0-wmf.16/extensions/GrowthExperiments/includes/NewcomerTasks/AddLink/LinkRecommendationHelper.php(88): CirrusSearch\CirrusSearch->resetWeightedTags(Object(MediaWiki\Page\PageIdentityValue), 'recommendation....')     
#7 /srv/mediawiki/php-1.43.0-wmf.16/extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php(205): GrowthExperiments\NewcomerTasks\AddLink\LinkRecommendationHelper->deleteLinkRecommendation(Object(MediaWiki\Page\PageIdentityV
alue), true)                                                                                                                                                                                                                                       
#8 /srv/mediawiki/php-1.43.0-wmf.16/extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php(119): GrowthExperiments\Maintenance\RevalidateLinkRecommendations->regenerateRecommendation(Object(GrowthExperiments\NewcomerTasks\$ddLink\LinkRecommendation))
#9 /srv/mediawiki/php-1.43.0-wmf.16/maintenance/includes/MaintenanceRunner.php(696): GrowthExperiments\Maintenance\RevalidateLinkRecommendations->execute()                                                                                       
#10 /srv/mediawiki/php-1.43.0-wmf.16/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#11 /srv/mediawiki/multiversion/MWScript.php(158): require_once('/srv/mediawiki/...')
#12 {main}

Restarting the script is sufficient to make it restart from where it left off, but it requires someone to closely monitor how the script is going, and to restart it if needed (alternatively, a bash script could probably do the same). We should look into why the script is failing to enqueue jobs from time to time.

So far, this looks to be a transient error that just happens occasionally. It might be a good idea to adjust CirrusSearch and/or the MediaWiki-Core-JobQueue system to retry a couple of times when a JobQueueError happens.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Urbanecm_WMF changed the subtype of this task from "Task" to "Production Error".Aug 3 2024, 10:35 PM

Tagging CirrusSearch, as the exception is occurring in that extension.

T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" looks possibly relevant, but I did revalidate all suggestions in the past, and it did not really end with fatals (or at least, not that often as it does now). So...did something change here?

dr0ptp4kt claimed this task.
dr0ptp4kt subscribed.
This comment was removed by dr0ptp4kt.

Sorry, wrong ticket, reopening.

EBernhardson added a project: Event-Platform.
EBernhardson subscribed.

While the stack trace indicates this happens through CirrusSearch, I'm dubious that any fix should be applied there. If job submission is intermittently failing (and a quick check of logstash shows 51 failures to enqueue jobs in the last 24 hours for various jobs, not cirrus specific) then the fix should be applied to job submission directly.

The following is a sampling of the error messages eventbus is receiving from the remote endpoint.

"event d72006d8-22b2-4d2f-9c53-40f81c52e033 of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.MessageGroupStatesUpdaterJob is not allowed in stream; mediawiki.job.MessageGroupStatesUpdaterJob is not configured."
"event f97f1978-efaf-4a88-bdd7-ec5f13b09020 of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.cirrusSearchLinksUpdate is not allowed in stream; mediawiki.job.cirrusSearchLinksUpdate is not configured."
"event 4cb90a99-77db-4677-b48e-bd1929970edc of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.activityUpdateJob is not allowed in stream; mediawiki.job.activityUpdateJob is not configured."
"event f7ac60d4-f33d-4861-800b-437962853ef8 of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.activityUpdateJob is not allowed in stream; mediawiki.job.activityUpdateJob is not configured."
"event 1863b49e-71e6-4fad-acad-4fd0f6bd25dd of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.webVideoTranscodePrioritized is not allowed in stream; mediawiki.job.webVideoTranscodePrioritized is not configured."
"event 91d8d9c6-5c23-44da-8384-a382d4ffe2ab of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.activityUpdateJob is not allowed in stream; mediawiki.job.activityUpdateJob is not configured."
"event e60c68a0-e7cb-49d5-99b1-3a1089b22aca of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.userOptionsUpdate is not allowed in stream; mediawiki.job.userOptionsUpdate is not configured."
"event 364ee416-129e-49c0-9609-1e77f7685187 of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.activityUpdateJob is not allowed in stream; mediawiki.job.activityUpdateJob is not configured."
"event 2fd2b82d-6601-41ec-920d-10ff2140f625 of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.activityUpdateJob is not allowed in stream; mediawiki.job.activityUpdateJob is not configured."
"event 0e3e37ae-eeae-4808-9b7f-134d22f0ab07 of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.setUserMentorDatabaseJob is not allowed in stream; mediawiki.job.setUserMentorDatabaseJob is not configured."
"event 557c6ced-ed4b-47fc-929b-eeec9d8c123f of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.setUserMentorDatabaseJob is not allowed in stream; mediawiki.job.setUserMentorDatabaseJob is not configured."
"event 17d1e8f8-5921-44fa-b669-5a9df80825f2 of schema at /mediawiki/job/1.0.0 destined to stream mediawiki.job.UpdateRepoOnMove is not allowed in stream; mediawiki.job.UpdateRepoOnMove is not configured."

It looks like there is some problem with the destination service not always loading the schemas properly?

This error message happens when eventgate-wikimedia succeeds in fetching stream configs from MW API action=streamconfigs, but the requested stream name is not defined.

mediawiki.job.* streams are a special case. These are the only streams that use a regex to match stream names.

eventgate-main is configured to permanently cache fetched stream configs. When a mediawiki.job.xxxxx stream matches the regex, it is cached as its full stream name so the regex doesn't have to be matched again.

@EBernhardson

  • did these errors happen in a short time frame?
  • Does this only happen for mediawiki.job.* streams?

It is possible there is some subtle bug in the stream config fetching & regex matching code, but if the error is intermittent and goes away without an eventgate-main restart, I'd expect the problem to be with the MW action API response.
FWIW this code was last modified in Fall of 2023: T326002#9263206
See also https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Data_Issues/2023-11_eventgate-analytics-external_Data_Loss

did these errors happen in a short time frame?

They are spread out through the day. The messages above all came from https://logstash.wikimedia.org/goto/35f7c7c25630aa6314535e6fd6d2cbd5
Basically i exported the backend response and parsed it with jq -r '.hits.hits[]._source.service_response."3"' | jq '.error[].context.message'

Does this only happen for mediawiki.job.* streams?

I only looked at the last 24 hours, but within that time frame yes.

if the error is intermittent and goes away without an eventgate-main restart, I'd expect the problem to be with the MW action API response.

I tried poking around for logs that would indicate this, but came up short. The only particularly definitive logs i'm finding are the ones where mediawiki logs the 500 response when submitting the job.

Perhaps one notable bit, it looks like these error messages started july 29th at 20:48. That is at about the same time that i rolled a restart on eventgate-main to support the private wiki streams.

Perhaps one notable bit, it looks like these error messages started july 29th at 20:48. That is at about the same time that i rolled a restart on eventgate-main to support the private wiki streams.

As a bit of a random guess i rolled another restart on eventgate-main, but it doesn't seem to have changed anything.

Untagging Search Platform, so that Growth and folks on EventGate (Data Engineering). Just a heads up.

i rolled another restart on eventgate-main, but it doesn't seem to have changed anything.

Did you only do eqiad? I think codfw is currently active DC, and eventgate-main pods have been running for 13 days.

Interestingly, it looks like this error is only happening on a single eventgate-main pod!

https://logstash.wikimedia.org/goto/6b1ceb109c202c289379c7481a5507ab

I'm going to kill and recreate the one pod and see what happens.

Actually, just going to do a rolling restart.

Mentioned in SAL (#wikimedia-operations) [2024-08-12T19:46:45Z] <ottomata> rolling restart of eventgate-main in codfw - T371767

I think codfw is currently active DC, and eventgate-main pods have been running for 13 days.

Uh, no, eqiad is active DC, sorry.

But, the pod that was erroring was in codfw. I suppose MW can still submit jobs in codfw?

Interesting! Poking at my bash history, it looks like i rolled the restart in both DC's the first time around (when deploying private wiki streams), but the restart for this ticket i forgot about codfw and only ran it in eqiad. In terms of job submission, mediawiki can certainly issue jobs from both datacenters, although i would expect the large majority of jobs to come from the active dc as part of write requests.

would expect the large majority of jobs to come from the active dc as part of write requests.

Makes sense! This would also be why the errors are so intermittent. Inactive DC + only one pod.

No errors since I restarted eventgate-main in codfw.

So why did this happen? As a guess: action=streamconfigs?streams=mediawiki.job.revalidateLinkRecommendations returned some bad response and it got permanently cached in eventgate.

We should follow up and perhaps not cache if the stream is not configured?

Ottomata claimed this task.

Seems fine for now, there may be a deeper bug to discover if this happens again.

Please reopen if needed.