Page MenuHomePhabricator

[Spec] Celery worker monitoring
Closed, ResolvedPublic

Description

How would health monitoring of individual workers work? Do we even need it? How would the implementation differ between labs and prod?

  • Do we want a watchdog? Can celeryd handle this for us?
  • Do we want icinga monitoring? How might that look?

Event Timeline

Halfak renamed this task from [Spec] Worker monitoring to [Spec] Celery worker monitoring.Jul 14 2016, 2:33 PM
Halfak triaged this task as Medium priority.Jul 14 2016, 2:53 PM
Halfak updated the task description. (Show Details)
Halfak lowered the priority of this task from Medium to Low.Aug 25 2016, 2:48 PM

I got interested in this question while stress testing on the new cluster. During most of my tests, some of the machines wouldn't process any jobs, and rarely were the workloads balanced. I installed flower as my user account, pointed it to the broker and connected through an SSH tunnel. What I learned is that our workers don't respond to any of the management or monitoring commands, such as celery inspect ping. Perhaps this is because we're starting them from within Python, by directly instantiating a Celery application, rather than starting "worker" from the command line.

At the minimum, we should investigate whether starting workers this way will degrade any of the built-in worker management strategies. Making the workers visible to Celery-native monitoring tools seems prudent, either way.

We used to use flower, but we'd dropped it when we developed support for graphite monitoring because flower was a bit of a pain. AFAIK, flower worked fine with the strategy that we're using now to start up workers.

Ladsgroup subscribed.

We have grafana, we have logstash. I call this resolved.

Ladsgroup claimed this task.
Ladsgroup moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.