Page MenuHomePhabricator

cassandra-a instance on aqs1007 is not starting
Closed, ResolvedPublic

Description

It is failing to start with:

ERROR [main] 2018-08-15 08:08:39,089 JVMStabilityInspector.java:78 - Exiting due to error while processing commit log during initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /srv/cassandra-a/commitlog/CommitLog-5-1530620590775.log

Removing this commit log might allow it to start and then a nodetool repair might recover the data. But this action requires validation by someone who understands cassandra.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2018-08-15T08:14:14Z] <gehel> masking cassandra-a instance on aqs1007 since it is flapping - T201986

ema triaged this task as Medium priority.Aug 15 2018, 8:16 AM

It looks like the host is up only since ~6 hours, and cassandra-a never actually managed to start.

root@aqs1007:~# uptime ; date ; journalctl -u cassandra-a.service | head
 08:20:20 up  6:39,  1 user,  load average: 1.22, 1.69, 2.13
Wed Aug 15 08:20:20 UTC 2018
-- Logs begin at Wed 2018-08-15 01:40:36 UTC, end at Wed 2018-08-15 08:20:15 UTC. --
Aug 15 01:41:20 aqs1007 systemd[1]: Started distributed storage system for structured data.
Aug 15 01:43:33 aqs1007 systemd[1]: cassandra-a.service: Main process exited, code=exited, status=100/n/a
Aug 15 01:43:33 aqs1007 systemd[1]: cassandra-a.service: Unit entered failed state.
Aug 15 01:43:33 aqs1007 systemd[1]: cassandra-a.service: Failed with result 'exit-code'.
Aug 15 02:07:18 aqs1007 systemd[1]: Started distributed storage system for structured data.
Aug 15 02:09:23 aqs1007 systemd[1]: cassandra-a.service: Main process exited, code=exited, status=100/n/a
Aug 15 02:09:23 aqs1007 systemd[1]: cassandra-a.service: Unit entered failed state.
Aug 15 02:09:23 aqs1007 systemd[1]: cassandra-a.service: Failed with result 'exit-code'.
Aug 15 02:36:37 aqs1007 systemd[1]: Started distributed storage system for structured data.
ema claimed this task.
ema added a subscriber: Joe.

@Joe removed the log and restarted cassandra-a. The service seems now to be working fine.

05:12 _joe_: moving away corrupted commitlog file on aqs1007 cassandra-a instance, trying to restart it

For the record: I removed the file (still on disk at /srv/cassandra-a/commitlog/CommitLog-5-1530620590775.log.bak once I noticed it was all zeroes.

Since there was no real information there, I preferred to try to restore the service.

Just for posterity sake: I don't know why the log would have been corrupted like this (almost certainly a bug), but the commitlog only exists to append incoming writes until what was buffered in memory can be flushed to storage. Since AQS replicates data 3 ways, only appends new values, and reads at quorum there is zero risk of losing data by deleting a committlog segment in a situation like this.