Page MenuHomePhabricator

Job queue broken on Beta Cluster
Closed, ResolvedPublic

Description

Jobs are not being executed (or at least some) on the Beta Cluster. There are several cpjobqueue, eventbus and kafka errors on logstash-beta going way up to EMERGENCY. As an example, no global renames are being processed, and extensions/CentralAuth/maintenance/fixStuckGlobalRename.php can't locate the jobs in the queue either; which suggests that jobs ain't being added to the queues for some reason. See https://deployment.wikipedia.beta.wmflabs.org/wiki/Special:GlobalRenameProgress and parent task. See also https://logstash-beta.wmflabs.org/goto/8435fc9247afcb5d5647b93803f97a41

Event Timeline

MarcoAurelio triaged this task as Unbreak Now! priority.Dec 25 2019, 8:51 PM

JobQueue as an essential feature being broken defeats the purpose of Beta Cluster as a testing platform to detect other issues before those reach production.

I think I fixed it. Same problem as seen with beta RESTBase. Root cause: T241263

@Pchelolo Thanks. Looking at Logstash feed it looks it keeps failing with "Error sending hot-shots message: Error: getaddrinfo ENOTFOUND labmon1001.eqiad.wmnet" and "worker died, restarting". I am online now, so if I can be of any help please let me know if I can assist.

Change 560607 had a related patch set uploaded (by MarcoAurelio; owner: MarcoAurelio):
[cloud/instance-puppet@master] deployment-mediawiki-parsoid10: Switch labmon1001 to cloudmetrics1002

https://gerrit.wikimedia.org/r/560607

Change 560607 abandoned by MarcoAurelio:
deployment-mediawiki-parsoid10: Switch labmon1001 to cloudmetrics1002

https://gerrit.wikimedia.org/r/560607

I performed a rename yesterday after various instances and services restarts, and other maintenance (cfr. T241462). It was a brand new spambot account with no edits, and just three wikis attached. It worked but took ca. 6 minutes to complete. Certainly not a normal execution time for production where such renames would take seconds to complete.

I am also seeing deployment-jobrunner03 messages of start/finish job executions.

Despite that, I'm not entirely sure the issue with JobQueue, Kafka, Redis, etc. is really fixed here so I'd appreciate if someone familiar with this could take a look.

Thanks.

thcipriani subscribed.

Release-Engineering-Team doesn't have any expertise in the jobqueue's inner-working. It'd likely be faster for another team to take a look.

Pchelolo claimed this task.

It worked but took ca. 6 minutes to complete.

I don't think it would be possible any more to debug this issue since the logs are probably rolled away by now.

I'm going to close this ticket. The job queue works now, and it's certainly not an unbreak now anymore. Please open a new one if you keep seeing problems with job queue in beta.