Sometimes worker restarts do not work
Closed, InvalidPublic
Actions

Assigned To

Authored By

	• mobrovac
	Jul 2 2015, 3:35 PM

Description

For reasons yet to be investigated, on certain occasions the master process does not restart some of its workers when they die. We need to find out the circumstances behind it and deal with the issue, as this was one of the factors that led to a recent RESTBase outage.

Event Timeline

• mobrovac created this task.Jul 2 2015, 3:35 PM

• mobrovac raised the priority of this task from to High.

• mobrovac updated the task description. (Show Details)

• mobrovac added projects: service-runner, RESTBase, Services.

• mobrovac subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 2 2015, 3:35 PM

Since this happened in the context of Cassandra issues, one possibility to check would be the restart actually happening as designed, but startup failing due to Cassandra query failures (schema checks during startup). Tell-tale signs of this would be low and uniform CPU time on apparently 'hanging' workers, and logstash full of 'restarting' messages. Once a worker has failed to start up, it will exit. Service-runner should then attempt to restart the worker with a delay of a second or so. With a semi-hanging cassandra cluster it might however take some time to actually fail the startup, which means that restart attempts might be much less frequent than 1/s.

• Pchelolo closed this task as Resolved.Jul 20 2015, 3:21 PM

• Pchelolo reopened this task as Open.

• Pchelolo set Security to None.

@mobrovac, @Pchelolo: Is this issue still present?

• GWicke lowered the priority of this task from High to Medium.Jan 27 2016, 11:16 PM

Doesn't seem to be occurring any more.

Sometimes worker restarts do not workClosed, InvalidPublicActions

Description

Event Timeline

Sometimes worker restarts do not work
Closed, InvalidPublic
Actions