Page MenuHomePhabricator

Search on betacommons is not indexing anything
Closed, ResolvedPublic5 Estimated Story PointsBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

What happens?:

What should have happened instead?:

If this is another JobQueue thingie this and T306758 and T307173 might have the same root cause.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

CirrusSearch logs in beta cluster are fairly quiet. Since jobqueue was mentioned i looked into cpjobqueue, the process that distributes jobs to workers. Checking horizon it suggests deployment-docker-cpjobqueue01 is the only host running cpjobqueue. The current logs say the instance is stuck in a crash loop:

ebernhardson@deployment-docker-cpjobqueue01:~$ sudo docker logs --tail 5 c456ed2a0e52 
{"name":"change-propagation","hostname":"c456ed2a0e52","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":16900,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2022-05-09T18:29:49.800Z","v":0}
{"name":"changeprop","hostname":"c456ed2a0e52","pid":17040,"level":"FATAL","err":{"message":"broker transport failure","name":"Error","stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/client.js:339:28","code":-195,"origin":"local","errno":-195},"stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/client.js:339:28","levelPath":"fatal/startup","msg":"Message not supplied","time":"2022-05-09T18:30:22.622Z","v":0}
{"name":"change-propagation","hostname":"c456ed2a0e52","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":17040,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2022-05-09T18:30:24.658Z","v":0}
{"name":"changeprop","hostname":"c456ed2a0e52","pid":17180,"level":"FATAL","err":{"message":"broker transport failure","name":"Error","stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/client.js:339:28","code":-195,"origin":"local","errno":-195},"stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/client.js:339:28","levelPath":"fatal/startup","msg":"Message not supplied","time":"2022-05-09T18:30:57.508Z","v":0}
{"name":"change-propagation","hostname":"c456ed2a0e52","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":17180,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2022-05-09T18:30:59.538Z","v":0}

The relevant configuration appears to be in a docker volume, it's refering to old domain names that are no longer valid. They need to be updated in the deployment-charts repository. In a quick review of deployment-charts files referencing the old kafka names (changeprop and jobqueue) this is not the only host referenced in related config files that is no longer valid. I can prep a patch to make all the hostnames valid again, but likely need @hnowlan to help verify and deploy updated charts.

ebernhardson@deployment-docker-cpjobqueue01:~$ sudo docker run -it --rm -v cpjobqueue:/srv alpine cat /srv/config.yaml | grep deployment-kafka-main
              - deployment-kafka-main-5.deployment-prep.eqiad.wmflabs:9092
              - deployment-kafka-main-6.deployment-prep.eqiad.wmflabs:9092

Change 790416 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/deployment-charts@master] changeprop: Update beta cluster domain names to .cloud

https://gerrit.wikimedia.org/r/790416

Some relation with T302699, I thought that was also due to changed servers or domains or something? @Majavah @Zabe @dom_walden ?

Change 790416 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Update beta cluster domain names to .cloud

https://gerrit.wikimedia.org/r/790416

Thanks for catching this, I have deployed the config with the fix.

Change 791070 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] [Beta Cluster] LabsServices: Move eventgate to new hosts

https://gerrit.wikimedia.org/r/791070

Change 791070 merged by jenkins-bot:

[operations/mediawiki-config@master] [Beta Cluster] LabsServices: Move eventgate to new hosts

https://gerrit.wikimedia.org/r/791070

Search still wasn't updating. Looking into more pieces related to job queue i noticed that wmf-config/LabsServices.php references old eventgate hosts with the old domain as well. Once a fix was deployed we start seeing job start/finish messages showing up in logstash. I ran the CirrusSearch Saneitizer over commonswiki and the example queries now work. I've started the Saneitizer in beta cluster running over the remaining wikis to increase the sanity of their indices as well, should finish in maybe 4 hours or so.

Some relation with T302699, I thought that was also due to changed servers or domains or something? @Majavah @Zabe @dom_walden ?

Seems unlikely to be related. The root cause here looks like some hosts were decomissioned and some configurations referencing the old hosts were missed.

I just uploaded https://commons.wikimedia.beta.wmflabs.org/wiki/File:Jason_Shaw_-_Big_Car_Theft.ogg. MP3 transcode worked. Not showing up in search yet, maybe needs more time.

https://commons.wikimedia.beta.wmflabs.org/wiki/File:Jason_Shaw_-_Big_Car_Theft.ogg?action=cirrusDump shows it being indexed and seems to appear in search results now, might just that beta is bit slow to index pages.