Page MenuHomePhabricator

Cassandra upgrades in staging attempted to start root instance
Closed, ResolvedPublic

Description

During the staging upgrade to 2.1.13, the packaging post-install invoked the sysv init script to start the root instance (the one based out of /var/lib/cassandra and /etc/cassandra). On several of the nodes, it actually succeeded. This could be Very Bad if it were to happen in production, particularly if it went unnoticed and the aberrant instance were to bootstrap.

Of the nodes that failed (meaning, where the aberrant instance did not start up)...

restbase2001-test.codfw.wmnet didn't start one due to a missing cassandra.yaml:

eevans@restbase-test2001:~$ bash -x /etc/init.d/cassandra status
+ DESC=Cassandra
+ NAME=cassandra
+ PIDFILE=/var/run/cassandra/cassandra.pid
+ SCRIPTNAME=/etc/init.d/cassandra
+ CONFDIR=/etc/cassandra
+ WAIT_FOR_START=10
+ CASSANDRA_HOME=/usr/share/cassandra
+ FD_LIMIT=100000
+ '[' -e /usr/share/cassandra/apache-cassandra.jar ']'
+ '[' -e /etc/cassandra/cassandra.yaml ']'
+ exit 0

3 others failed only because the data under /var/lib/cassandra predates the cluster rename from "Test Cluster" to "services-test":

ERROR [main] 2016-02-18 18:04:23,351 CassandraDaemon.java:294 - Fatal exception during initialization
org.apache.cassandra.exceptions.ConfigurationException: Saved cluster name Test Cluster != configured name services-test
        at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:613) ~[apache-cassandra-2.1.13.jar:2.1.13]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:290) [apache-cassandra-2.1.13.jar:2.1.13]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:564) [apache-cassandra-2.1.13.jar:2.1.13]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:653) [apache-cassandra-2.1.13.jar:2.1.13]

What is not clear to me, is why this hasn't been an issue before.


And obviously, going forward we need a concrete (non-accidental) way of disabling these non-root instances.

Event Timeline

Change 272612 had a related patch set uploaded (by Eevans):
disable package-installed initscript

https://gerrit.wikimedia.org/r/272612

Feels like a bit of a kludge, but I submitted https://gerrit.wikimedia.org/r/#/c/272612/, which overwrites /etc/init.d/cassandra with a no-op script. TTBMK, we're now fully using systemd units (put in place as part of the multi-instance changeset), and I can't think of any scenario where it would be acceptable to run the package-installed initscript.

Change 272612 merged by Filippo Giunchedi:
disable package-installed initscript

https://gerrit.wikimedia.org/r/272612

This has been deployed to all nodes; Resolving.