How would health monitoring of individual workers work? Do we even need it? How would the implementation differ between labs and prod?
- Do we want a watchdog? Can celeryd handle this for us?
- Do we want icinga monitoring? How might that look?
How would health monitoring of individual workers work? Do we even need it? How would the implementation differ between labs and prod?
I got interested in this question while stress testing on the new cluster. During most of my tests, some of the machines wouldn't process any jobs, and rarely were the workloads balanced. I installed flower as my user account, pointed it to the broker and connected through an SSH tunnel. What I learned is that our workers don't respond to any of the management or monitoring commands, such as celery inspect ping. Perhaps this is because we're starting them from within Python, by directly instantiating a Celery application, rather than starting "worker" from the command line.
At the minimum, we should investigate whether starting workers this way will degrade any of the built-in worker management strategies. Making the workers visible to Celery-native monitoring tools seems prudent, either way.
We used to use flower, but we'd dropped it when we developed support for graphite monitoring because flower was a bit of a pain. AFAIK, flower worked fine with the strategy that we're using now to start up workers.