Page MenuHomePhabricator

Sometimes worker restarts do not work
Closed, InvalidPublic

Description

For reasons yet to be investigated, on certain occasions the master process does not restart some of its workers when they die. We need to find out the circumstances behind it and deal with the issue, as this was one of the factors that led to a recent RESTBase outage.

Event Timeline

mobrovac raised the priority of this task from to High.
mobrovac updated the task description. (Show Details)
mobrovac added a subscriber: mobrovac.

Since this happened in the context of Cassandra issues, one possibility to check would be the restart actually happening as designed, but startup failing due to Cassandra query failures (schema checks during startup). Tell-tale signs of this would be low and uniform CPU time on apparently 'hanging' workers, and logstash full of 'restarting' messages. Once a worker has failed to start up, it will exit. Service-runner should then attempt to restart the worker with a delay of a second or so. With a semi-hanging cassandra cluster it might however take some time to actually fail the startup, which means that restart attempts might be much less frequent than 1/s.

Pchelolo reopened this task as Open.
Pchelolo set Security to None.
GWicke lowered the priority of this task from High to Medium.Jan 27 2016, 11:16 PM
mobrovac claimed this task.

Doesn't seem to be occurring any more.