For reasons yet to be investigated, on certain occasions the master process does not restart some of its workers when they die. We need to find out the circumstances behind it and deal with the issue, as this was one of the factors that led to a recent RESTBase outage.
Description
Description
Event Timeline
Comment Actions
Since this happened in the context of Cassandra issues, one possibility to check would be the restart actually happening as designed, but startup failing due to Cassandra query failures (schema checks during startup). Tell-tale signs of this would be low and uniform CPU time on apparently 'hanging' workers, and logstash full of 'restarting' messages. Once a worker has failed to start up, it will exit. Service-runner should then attempt to restart the worker with a delay of a second or so. With a semi-hanging cassandra cluster it might however take some time to actually fail the startup, which means that restart attempts might be much less frequent than 1/s.