Celery workers went down on several nodes recently and no alarm sounded because we were still able to meet capacity.
We should have a way to determine if a celery worker service is running and raise icinga notifications if it has stopped.
Celery workers went down on several nodes recently and no alarm sounded because we were still able to meet capacity.
We should have a way to determine if a celery worker service is running and raise icinga notifications if it has stopped.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Joe | T230917 celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log | |||
Resolved | Halfak | T230931 Monitor ORES celery worker status |