Maniphest T307862

Search on betacommons is not indexing anything
Closed, ResolvedPublic5 Estimated Story PointsBUG REPORT
Actions

Description

List of steps to reproduce (step by step, including full links if applicable):

What happens?:

Returns only Jason Shaw - Ecstasy X at 8kbps.mp3
Returns nothing
Returns nothing

What should have happened instead?:

Should have also returned the more recently uploaded Jason Shaw - Get A Move On.mp3
Return FRC4K30.webm
Return FRC4K30.webm

If this is another JobQueue thingie this and T306758 and T307173 might have the same root cause.

Details

	Subject	Repo	Branch	Lines +/-
	[Beta Cluster] LabsServices: Move eventgate to new hosts	operations/mediawiki-config	master	+3 -3
	changeprop: Update beta cluster domain names to .cloud	operations/deployment-charts	master	+7 -7

Customize query in gerrit

Related Objects

Mentioned In: T306758: Media files on betacommons are not transcoding
Mentioned Here: T302699: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022)
T306758: Media files on betacommons are not transcoding
T307173: Edits are incorrectly marked as "updated since your last visit" in page history on betacommons

Event Timeline

AlexisJazz created this task.May 8 2022, 10:50 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 8 2022, 10:50 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

AlexisJazz updated the task description. (Show Details)May 8 2022, 10:53 AM

AlexisJazz added projects: Beta-Cluster-Infrastructure, Beta-Cluster-reproducible.May 8 2022, 11:10 AM

MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.May 9 2022, 3:28 PM

MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

AlexisJazz updated the task description. (Show Details)May 9 2022, 3:45 PM

MPhamWMF set the point value for this task to 5.May 9 2022, 3:58 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

CirrusSearch logs in beta cluster are fairly quiet. Since jobqueue was mentioned i looked into cpjobqueue, the process that distributes jobs to workers. Checking horizon it suggests deployment-docker-cpjobqueue01 is the only host running cpjobqueue. The current logs say the instance is stuck in a crash loop:

ebernhardson@deployment-docker-cpjobqueue01:~$ sudo docker logs --tail 5 c456ed2a0e52 
{"name":"change-propagation","hostname":"c456ed2a0e52","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":16900,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2022-05-09T18:29:49.800Z","v":0}
{"name":"changeprop","hostname":"c456ed2a0e52","pid":17040,"level":"FATAL","err":{"message":"broker transport failure","name":"Error","stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/client.js:339:28","code":-195,"origin":"local","errno":-195},"stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/client.js:339:28","levelPath":"fatal/startup","msg":"Message not supplied","time":"2022-05-09T18:30:22.622Z","v":0}
{"name":"change-propagation","hostname":"c456ed2a0e52","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":17040,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2022-05-09T18:30:24.658Z","v":0}
{"name":"changeprop","hostname":"c456ed2a0e52","pid":17180,"level":"FATAL","err":{"message":"broker transport failure","name":"Error","stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/client.js:339:28","code":-195,"origin":"local","errno":-195},"stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/client.js:339:28","levelPath":"fatal/startup","msg":"Message not supplied","time":"2022-05-09T18:30:57.508Z","v":0}
{"name":"change-propagation","hostname":"c456ed2a0e52","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":17180,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2022-05-09T18:30:59.538Z","v":0}

The relevant configuration appears to be in a docker volume, it's refering to old domain names that are no longer valid. They need to be updated in the deployment-charts repository. In a quick review of deployment-charts files referencing the old kafka names (changeprop and jobqueue) this is not the only host referenced in related config files that is no longer valid. I can prep a patch to make all the hostnames valid again, but likely need @hnowlan to help verify and deploy updated charts.

ebernhardson@deployment-docker-cpjobqueue01:~$ sudo docker run -it --rm -v cpjobqueue:/srv alpine cat /srv/config.yaml | grep deployment-kafka-main
              - deployment-kafka-main-5.deployment-prep.eqiad.wmflabs:9092
              - deployment-kafka-main-6.deployment-prep.eqiad.wmflabs:9092

Change 790416 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/deployment-charts@master] changeprop: Update beta cluster domain names to .cloud

https://gerrit.wikimedia.org/r/790416

gerritbot added a project: Patch-For-Review.May 9 2022, 7:02 PM

EBernhardson claimed this task.May 9 2022, 7:39 PM

EBernhardson moved this task from Ready for Dev -- SWE to Needs review on the Discovery-Search (Current work) board.

Some relation with T302699, I thought that was also due to changed servers or domains or something? @Majavah @Zabe @dom_walden ?

taavi unsubscribed.May 10 2022, 5:21 AM

Change 790416 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Update beta cluster domain names to .cloud

https://gerrit.wikimedia.org/r/790416

Thanks for catching this, I have deployed the config with the fix.

Maintenance_bot removed a project: Patch-For-Review.May 10 2022, 10:30 AM

Change 791070 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] [Beta Cluster] LabsServices: Move eventgate to new hosts

https://gerrit.wikimedia.org/r/791070

gerritbot added a project: Patch-For-Review.May 11 2022, 7:32 PM

Change 791070 merged by jenkins-bot:

[operations/mediawiki-config@master] [Beta Cluster] LabsServices: Move eventgate to new hosts

https://gerrit.wikimedia.org/r/791070

Search still wasn't updating. Looking into more pieces related to job queue i noticed that wmf-config/LabsServices.php references old eventgate hosts with the old domain as well. Once a fix was deployed we start seeing job start/finish messages showing up in logstash. I ran the CirrusSearch Saneitizer over commonswiki and the example queries now work. I've started the Saneitizer in beta cluster running over the remaining wikis to increase the sanity of their indices as well, should finish in maybe 4 hours or so.

In T307862#7915849, @AlexisJazz wrote:

Some relation with T302699, I thought that was also due to changed servers or domains or something? @Majavah @Zabe @dom_walden ?

Seems unlikely to be related. The root cause here looks like some hosts were decomissioned and some configurations referencing the old hosts were missed.

EBernhardson removed a subscriber: taavi.May 11 2022, 10:30 PM

Maintenance_bot removed a project: Patch-For-Review.May 11 2022, 11:30 PM

I just uploaded https://commons.wikimedia.beta.wmflabs.org/wiki/File:Jason_Shaw_-_Big_Car_Theft.ogg. MP3 transcode worked. Not showing up in search yet, maybe needs more time.

https://commons.wikimedia.beta.wmflabs.org/wiki/File:Jason_Shaw_-_Big_Car_Theft.ogg?action=cirrusDump shows it being indexed and seems to appear in search results now, might just that beta is bit slow to index pages.

AlexisJazz mentioned this in T306758: Media files on betacommons are not transcoding.May 15 2022, 3:11 PM

Gehel closed this task as Resolved.May 16 2022, 2:57 PM

Search on betacommons is not indexing anythingClosed, ResolvedPublic5 Estimated Story PointsBUG REPORTActions

Description

Details

Related Objects

Event Timeline

Search on betacommons is not indexing anything
Closed, ResolvedPublic5 Estimated Story PointsBUG REPORT
Actions