Page MenuHomePhabricator

Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount
Closed, DuplicatePublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.36.0-wmf.9

message
Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount

Impact

There were about 165 of these just now, in what looks like a two-minute spike. Not sure if this is worth rolling back train for. Would appreciate advice.

Notes

Details

Request ID
06cd76dd-35f6-4e68-ba49-719530975972
Request URL
https://jobrunner.discovery.wmnet/rpc/RunSingleJob.php
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.36.0-wmf.9/includes/jobqueue/JobQueue.php(374): MediaWiki\Extension\EventBus\Adapters\JobQueue\JobQueueEventBus->doBatchPush(array, integer)
#1 /srv/mediawiki/php-1.36.0-wmf.9/includes/jobqueue/JobQueue.php(344): JobQueue->batchPush(array, integer)
#2 /srv/mediawiki/php-1.36.0-wmf.9/includes/jobqueue/JobQueueGroup.php(171): JobQueue->push(array)
#3 /srv/mediawiki/php-1.36.0-wmf.9/extensions/CirrusSearch/includes/Job/LinksUpdate.php(76): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.36.0-wmf.9/extensions/CirrusSearch/includes/Job/JobTraits.php(136): CirrusSearch\Job\LinksUpdate->doJob()
#5 /srv/mediawiki/php-1.36.0-wmf.9/extensions/EventBus/includes/JobExecutor.php(79): CirrusSearch\Job\CirrusTitleJob->run()
#6 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#7 {main}

Event Timeline

Restricted Application added a project: Discovery-Search. · View Herald TranscriptSep 17 2020, 1:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
LarsWirzenius triaged this task as Unbreak Now! priority.Sep 17 2020, 1:40 PM

I suspect that event gate stopped accepting events for a while and thus is probably not related to this specific job but all jobs that EventBus tried to push to eventgate at this time (searching for channel:EventBus NOT cirrusSearchIncomingLinkCount in logstash should confirm this).

Restricted Application added a project: Analytics. · View Herald TranscriptSep 17 2020, 2:15 PM
RLazarus lowered the priority of this task from Unbreak Now! to High.Sep 17 2020, 3:42 PM
RLazarus added subscribers: Ottomata, RLazarus.

Yeah, this does look like it was related to the eventgate-main deployments at 13:20 and 13:33 -- traffic may not have shifted gracefully from the old instances to the new ones, resulting in a transient spike of errors.

Tagging in @Ottomata to comment, and also optimistically downgrading this from UBN. I would not plan to roll back the train.

mforns assigned this task to Ottomata.Sep 17 2020, 3:42 PM
mforns added a project: Analytics-Kanban.
CDanis added a subscriber: CDanis.Sep 17 2020, 3:44 PM

For posterity, logstash link: https://logstash.wikimedia.org/goto/f8c9aec62cbdb9dacf931493e056196c

I believe this was almost certainly https://sal.toolforge.org/log/MmJFnHQBv7KcG9M-PnFl

13:33	<otto@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
13:33	<otto@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .

There were some LVS changes made at the same time, but the impactful-to-traffic ones happened much earlier, at 13:19.

Not sure if something went wrong with specifically this Eventgate release, or if our usual deploy procedure happens to kill in-flight requests.

I did a eventgate-main deploy to apply the new helmfile layout. Should have been essentially a no-op, but there was a comment change in the helmfile diff, so new pods were deployed.

At the time, I do remember seeing some weird pod state, but I thought it was just k8s doing its thing:

13:36:04 [@deploy1001:/srv/ … /helmfile.d/services/eventgate-main] (master)[a92b8b6] ± kube_env eventgate-main codfw; kubectl get pods -o wide
NAME                                         READY   STATUS        RESTARTS   AGE     IP              NODE                         NOMINATED NODE
eventgate-main-canary-657d5cfd8c-2jwf4       0/3     Evicted       0          35d     <none>          kubernetes2001.codfw.wmnet   <none>
eventgate-main-canary-bd467dfd-fxsnx         3/3     Running       0          2m41s   10.192.66.108   kubernetes2007.codfw.wmnet   <none>
eventgate-main-production-58544b6b69-2b92l   0/3     Evicted       0          35d     <none>          kubernetes2009.codfw.wmnet   <none>
eventgate-main-production-58544b6b69-cv2bw   0/3     Evicted       0          35d     <none>          kubernetes2011.codfw.wmnet   <none>
eventgate-main-production-58544b6b69-trxxh   1/3     Terminating   1          3d4h    10.192.65.144   kubernetes2008.codfw.wmnet   <none>
eventgate-main-production-58544b6b69-vrjm6   0/3     Evicted       0          35d     <none>          kubernetes2008.codfw.wmnet   <none>
eventgate-main-production-58544b6b69-wb55f   0/3     Evicted       0          5d2h    <none>          kubernetes2007.codfw.wmnet   <none>
eventgate-main-production-779646cf8-224jj    3/3     Running       0          2m41s   10.192.67.82    kubernetes2011.codfw.wmnet   <none>
eventgate-main-production-779646cf8-j4vmv    3/3     Running       0          111s    10.192.66.175   kubernetes2010.codfw.wmnet   <none>
eventgate-main-production-779646cf8-l492w    3/3     Running       0          2m16s   10.192.66.30    kubernetes2013.codfw.wmnet   <none>
tiller-deploy-5867d8744d-qhhgp               0/1     Evicted       0          78d     <none>          kubernetes2008.codfw.wmnet   <none>
tiller-deploy-5867d8744d-wmrjm               1/1     Running       0          5d2h    10.192.66.70    kubernetes2007.codfw.wmnet   <none>
Ottomata added a subscriber: Pchelolo.EditedSep 17 2020, 4:01 PM

Ok, I've been informed Evicted is normal and not related. The MW error in logstash says

upstream connect error or disconnect/reset before headers. reset reason: connection termination

I guess k8s killed the pod before allowing the request to finish? IIUC, EventBus should have retried the HTTP request again (@Pchelolo is that right?).

jeena added a subscriber: jeena.EditedSep 18 2020, 10:30 PM

Various jobenqueue errors happened today in the past 6 hours with spikes of over 10k. I didn't see any k8s deployments correlated with them.

thcipriani raised the priority of this task from High to Unbreak Now!.Sep 18 2020, 10:33 PM
thcipriani added a subscriber: thcipriani.

Various jobenqueue errors happened today in the past 6 hours with spikes of over 10k. I didn't see any k8s deployments correlated with them.

There was a possibly unrelated alert in -operations around the time of one of the spikes:

PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server

Not sure if all the jobqueue errors belong on this ticket or a seperate ticket, but since the right folks are already here, raising to UBN! as there are a few waves of exceptions in the logs and this could use some eyes from folks familiar with that system.

thcipriani lowered the priority of this task from Unbreak Now! to High.Sep 19 2020, 12:00 AM

After a few rounds of spikes, messages seem to have subsided, keeping an eye on logs for the time being.

fdans moved this task from Incoming to Event Platform on the Analytics board.Oct 5 2020, 4:40 PM