[Spec] Celery worker monitoring
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Jul 11 2016, 2:59 PM

Description

How would health monitoring of individual workers work? Do we even need it? How would the implementation differ between labs and prod?

Do we want a watchdog? Can celeryd handle this for us?
Do we want icinga monitoring? How might that look?

Related Objects

Mentioned In: T139384: Per web node monitoring in prod

Event Timeline

Halfak created this task.Jul 11 2016, 2:59 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 11 2016, 2:59 PM

Halfak mentioned this in T139384: Per web node monitoring in prod.Jul 11 2016, 2:59 PM

Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.Jul 14 2016, 2:25 PM

Halfak renamed this task from [Spec] Worker monitoring to [Spec] Celery worker monitoring.Jul 14 2016, 2:33 PM

Halfak triaged this task as Medium priority.Jul 14 2016, 2:53 PM

Halfak updated the task description. (Show Details)

Halfak lowered the priority of this task from Medium to Low.Aug 25 2016, 2:48 PM

I got interested in this question while stress testing on the new cluster. During most of my tests, some of the machines wouldn't process any jobs, and rarely were the workloads balanced. I installed flower as my user account, pointed it to the broker and connected through an SSH tunnel. What I learned is that our workers don't respond to any of the management or monitoring commands, such as celery inspect ping. Perhaps this is because we're starting them from within Python, by directly instantiating a Celery application, rather than starting "worker" from the command line.

At the minimum, we should investigate whether starting workers this way will degrade any of the built-in worker management strategies. Making the workers visible to Celery-native monitoring tools seems prudent, either way.

We used to use flower, but we'd dropped it when we developed support for graphite monitoring because flower was a bit of a pain. AFAIK, flower worked fine with the strategy that we're using now to start up workers.

We have grafana, we have logstash. I call this resolved.

Ladsgroup closed this task as Resolved.Jan 28 2019, 11:13 AM

Ladsgroup claimed this task.

Ladsgroup moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJan 28 2019, 11:13 AM

[Spec] Celery worker monitoringClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

[Spec] Celery worker monitoring
Closed, ResolvedPublic
Actions