Page MenuHomePhabricator

Cassandra instance with corrupted commit log after powercycle of restbase1027
Closed, ResolvedPublic

Description

Today I powercycled restbase1027 because it was not responsive (failed to publish metrics in days, no mgmt root tty available, etc..). After the powercycle the a instance fails with:

org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Mutation checksum failure at 12188413 in Next section at 12188136 in CommitLog-7-1692305823300.log

What I'd do is:

  • Move the file in my home dir
  • restart cassandra-a
  • launch nodetool-a repair --full

The host is currently depooled.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-08-28T10:57:57Z] <cgoubert@cumin1001> START - Cookbook sre.hosts.downtime for 5:00:00 on restbase1027.eqiad.wmnet with reason: T345058 - service probes flapping

Mentioned in SAL (#wikimedia-operations) [2023-08-28T10:58:09Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on restbase1027.eqiad.wmnet with reason: T345058 - service probes flapping

@Eevans do you think it is a safe plan? If so I'll try to execute it :)

@Eevans do you think it is a safe plan? If so I'll try to execute it :)

Given what's currently hosted on this cluster, I think we can just delete the corrupt commitlog and restart (and forego any repairs, as well). I'm not at all endorsing this as best practice generally, but in this circumstance, and on this cluster, we should be fine.

I'll go do exactly that now.

Thanks @elukey !

@Eevans ack! When you have a moment could you add some info about when it is good or not to start a repair (full or partial)? If there is something already on Wikitech I'll read it :) I am asking since this may happen again in the future, so we'll be more confident in knowing what to do (if you are not around and) if some docs are present. Thanks!

@Eevans ack! When you have a moment could you add some info about when it is good or not to start a repair (full or partial)? If there is something already on Wikitech I'll read it :) I am asking since this may happen again in the future, so we'll be more confident in knowing what to do (if you are not around and) if some docs are present. Thanks!

That sounds like a good idea, but I'm going to have to have a think on what that should look like (it's not straightforward).

Spoiler alert though: For good or bad, we're not really setup to be doing repairs at all.

Spoiler alert though: For good or bad, we're not really setup to be doing repairs at all.

Does it mean that they will not do anything (so leaving the status of the node as it is) or that we may cause damage if we run one?

Spoiler alert though: For good or bad, we're not really setup to be doing repairs at all.

Does it mean that they will not do anything (so leaving the status of the node as it is) or that we may cause damage if we run one?

Depending on the cluster, a full repair could take a very long time (so long that it risks failing for something transient). Incremental repairs are what you want, but they will split the compaction pool, the side-effects of which could actually be beneficial, but it could also be detrimental (something we'd want to test for). Either way, they have the ability to be very expensive, which might or might not be a problem. They could also result in a (temporary) increase in storage utilization, which shouldn't be an issue —if we're managing capacity properly— but might be for restbase which is currently short on free-space.

Doing repairs is really something you're meant to do with some regularity (making any run due to an event like this more... incremental), and it requires infrastructure for coordination, monitoring, etc. We haven't invested in that (yet) because everything we have so far has been secondary data; Inconsistencies (the probabilities of which are still exceeding rare) haven't tended to be something that would be serious, and push-comes-to-shove we can always regenerate.

Eevans claimed this task.

@Eevans ack! When you have a moment could you add some info about when it is good or not to start a repair (full or partial)? If there is something already on Wikitech I'll read it :) I am asking since this may happen again in the future, so we'll be more confident in knowing what to do (if you are not around and) if some docs are present. Thanks!

That sounds like a good idea, but I'm going to have to have a think on what that should look like (it's not straightforward).

Spoiler alert though: For good or bad, we're not really setup to be doing repairs at all.

After thinking about this more (read: mashing the edit button and then endlessly vacillating), I'm not sure we're in a position to create (useful) documentation for this (yet). I could certainly write something, but the result should be reasonably concise, actionable, and not require a lot of case-by-case judgement and experience/context. In other words, we need to have a procedure, before we can document that procedure.

I believe that we need to first establish SLOs for each of the clusters. This will inform our tolerances with respect to consistency, data loss, and availability. We can then determine what we need to have in place to meet those requirements (more or less hardware, infrastructure for repair coordination, etc), and establish (and document) the processes.

We're going to prioritize doing this in the coming quarters. Until we have some progress on that, I'm not certain I know what in the way of documentation would be helpful (versus confusing and/or misleading). Of course if you have an suggestions for something in the interim, please let me know (and this might be easier for you, given my proximity)!

@Eevans I totally understand your point of view, but at the same time I am not clear what procedure we should follow when an issue like this one happens (while on-call etc..). Is your recommendation to just to let the instance depooled, stop puppet etc.. and then ping Data Persistence for a permanent fix?

... Is your recommendation to just to let the instance depooled, stop puppet etc.. and then ping Data Persistence for a permanent fix?

As unsatisfying as it sounds, I believe this is what has been happening (in practice); I think this is the defacto policy, yes. 😕

For this particular case, I went ahead and added the following:

https://wikitech.wikimedia.org/wiki/Cassandra#What_to_do_if...

I don't how helpful this is, since this was a pretty exceptional event —but such as it is— it should be actionable and correct (it's what I would do). Perhaps over time we can add enough others that the sum of them becomes...helpful.

No no for the moment it is fine, what I wanted to do is to avoid single persons on-call for events like Cassandra being in trouble (namely, you :). We can slowly build knowledge over time, I am in if you want to evangelize more how to deal with Cassandra events!