Job queue broken on Beta Cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MarcoAurelio
	Dec 25 2019, 8:50 PM

Description

Jobs are not being executed (or at least some) on the Beta Cluster. There are several cpjobqueue, eventbus and kafka errors on logstash-beta going way up to EMERGENCY. As an example, no global renames are being processed, and extensions/CentralAuth/maintenance/fixStuckGlobalRename.php can't locate the jobs in the queue either; which suggests that jobs ain't being added to the queues for some reason. See https://deployment.wikipedia.beta.wmflabs.org/wiki/Special:GlobalRenameProgress and parent task. See also https://logstash-beta.wmflabs.org/goto/8435fc9247afcb5d5647b93803f97a41

Details

	Subject	Repo	Branch	Lines +/-
	deployment-mediawiki-parsoid10: Switch labmon1001 to cloudmetrics1002	cloud/instance-puppet	master	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T241294 Global renames aren't being processed on beta cluster
		Resolved		• Pchelolo	T241448 Job queue broken on Beta Cluster

Event Timeline

MarcoAurelio created this task.Dec 25 2019, 8:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 25 2019, 8:50 PM

JobQueue as an essential feature being broken defeats the purpose of Beta Cluster as a testing platform to detect other issues before those reach production.

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptDec 25 2019, 8:51 PM

MarcoAurelio added a parent task: T241294: Global renames aren't being processed on beta cluster.Dec 25 2019, 8:53 PM

MarcoAurelio added a project: Release-Engineering-Team.Dec 25 2019, 8:58 PM

MarcoAurelio updated the task description. (Show Details)

MarcoAurelio added subscribers: • Pchelolo, Ottomata, Krenair.

I think I fixed it. Same problem as seen with beta RESTBase. Root cause: T241263

@Pchelolo Thanks. Looking at Logstash feed it looks it keeps failing with "Error sending hot-shots message: Error: getaddrinfo ENOTFOUND labmon1001.eqiad.wmnet" and "worker died, restarting". I am online now, so if I can be of any help please let me know if I can assist.

DannyS712 subscribed.Dec 25 2019, 10:35 PM

Change 560607 had a related patch set uploaded (by MarcoAurelio; owner: MarcoAurelio):
[cloud/instance-puppet@master] deployment-mediawiki-parsoid10: Switch labmon1001 to cloudmetrics1002

https://gerrit.wikimedia.org/r/560607

gerritbot added a project: Patch-For-Review.Dec 26 2019, 11:40 AM

Change 560607 abandoned by MarcoAurelio:
deployment-mediawiki-parsoid10: Switch labmon1001 to cloudmetrics1002

https://gerrit.wikimedia.org/r/560607

Maintenance_bot removed a project: Patch-For-Review.Dec 26 2019, 12:10 PM

Isaacandy subscribed.Dec 28 2019, 7:48 AM

I performed a rename yesterday after various instances and services restarts, and other maintenance (cfr. T241462). It was a brand new spambot account with no edits, and just three wikis attached. It worked but took ca. 6 minutes to complete. Certainly not a normal execution time for production where such renames would take seconds to complete.

I am also seeing deployment-jobrunner03 messages of start/finish job executions.

Despite that, I'm not entirely sure the issue with JobQueue, Kafka, Redis, etc. is really fixed here so I'd appreciate if someone familiar with this could take a look.

Thanks.

Release-Engineering-Team doesn't have any expertise in the jobqueue's inner-working. It'd likely be faster for another team to take a look.

WDoranWMF added a project: Platform Engineering.Jan 8 2020, 5:32 PM

It worked but took ca. 6 minutes to complete.

I don't think it would be possible any more to debug this issue since the logs are probably rolled away by now.

I'm going to close this ticket. The job queue works now, and it's certainly not an unbreak now anymore. Please open a new one if you keep seeing problems with job queue in beta.

Job queue broken on Beta ClusterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Job queue broken on Beta Cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...