Page MenuHomePhabricator

service cassandra-b fails on restbase2004
Closed, ResolvedPublic

Description

04:37 < icinga-wm> RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
11:35 < icinga-wm> RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active
11:41 < icinga-wm> PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
14:06 < icinga-wm> RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active
14:11 < icinga-wm> PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
16:36 < icinga-wm> RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active
16:42 < icinga-wm> PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
17:06 < icinga-wm> RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active
17:10 < icinga-wm> PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed

Event Timeline

We had these messages in channel for many hours, keeps crashing and then coming back? Did nobody get pages or mails?

[restbase2004:~] $ sudo -s
root@restbase2004:~# service cassandra-b status
● cassandra-b.service - distributed storage system for structured data
   Loaded: loaded (/lib/systemd/system/cassandra-b.service; static)
   Active: failed (Result: exit-code) since Tue 2016-04-19 01:31:20 UTC; 2min 16s ago
  Process: 11732 ExecStart=/usr/sbin/cassandra -p /var/run/cassandra/cassandra-b.pid (code=exited, status=3)
 Main PID: 11732 (code=exited, status=3)

Apr 19 01:31:18 restbase2004 cassandra[11544]: at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:794)
Apr 19 01:31:18 restbase2004 cassandra[11544]: at org.apache.cassandra.service.StorageService.initServer(StorageService.java:726)
Apr 19 01:31:18 restbase2004 cassandra[11544]: at org.apache.cassandra.service.StorageService.initServer(StorageService.java:617)
Apr 19 01:31:18 restbase2004 cassandra[11544]: at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:389)
Apr 19 01:31:18 restbase2004 cassandra[11544]: at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:564)
Apr 19 01:31:18 restbase2004 cassandra[11544]: at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:653)
Apr 19 01:31:18 restbase2004 cassandra[11544]: Exception encountered during startup: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap w...t is true
Apr 19 01:31:18 restbase2004 cassandra[11544]: WARN  01:31:18 No local state or state is in silent shutdown, not announcing shutdown
Apr 19 01:31:20 restbase2004 systemd[1]: cassandra-b.service: main process exited, code=exited, status=3/NOTIMPLEMENTED
Apr 19 01:31:20 restbase2004 systemd[1]: Unit cassandra-b.service entered failed state.
Hint: Some lines were ellipsized, use -l to show in full.
root@restbase2004:~# systemctl cassandra-b start
Unknown operation 'cassandra-b'.
root@restbase2004:~# systemctl start cassandra-b
root@restbase2004:~# systemctl status cassandra-b
● cassandra-b.service - distributed storage system for structured data
   Loaded: loaded (/lib/systemd/system/cassandra-b.service; static)
   Active: active (running) since Tue 2016-04-19 01:34:21 UTC; 5s ago
 Main PID: 13726 (java)
   CGroup: /system.slice/cassandra-b.service
           └─13726 java -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -...
Dzahn renamed this task from service cassandra-b failed on restbase2004 to service cassandra-b fails on restbase2004.Apr 19 2016, 1:38 AM
Dzahn edited projects, added RESTBase-Cassandra, SRE; removed RESTBase.

This node should not be running, it is administratively down; I'm not sure what happened that it started to send notifications now.

fgiunchedi triaged this task as Medium priority.Apr 27 2016, 3:07 PM
Eevans claimed this task.

I think this must have just been an Icinga snafu, the over optimistic use of an expiring acknowledgement, or somesuch. The instance in question is administratively down until after new hardware/capacity is added to rack 'c' in codfw, and seems to have a persistent acknowledgement now.

I'm resolving; Feel free to reopen if I missed something.

Fine with me, but i fail to see how "cassandra[11544]: Exception encountered during startup: " can be an icinga snafu.