Page MenuHomePhabricator

Perform initial (manual) repair of Cassandra cluster
Closed, ResolvedPublic

Description

Cassandra leverages a number of mechanisms in order to converge on a consistent state (read-repair, hinted-handoff, etc). However, it is still necessary to periodically issue repairs to keep any missed inconsistencies from accumulating.

Given that repairs have never really been performed on this cluster, and given the known history of (possibly widespread) inconsistent deletes, we can expect the initial repair to be quite extensive. The advantage of starting a repair sooner, rather than later, is that the current state of the three new nodes (restbase100[7-9].eqiad) are likely quite close to what they should be (having just recently concluded bootstrapping). Additionally, the overall size of the cluster dataset is currently low, and can only be expected to get larger.

Prior to enabling automated, recurring anti-entropy repairs (T92355). We should begin with one or more closely monitored manual invocations.

Related: T92355


nodestatusstartedcompletedcomments
restbase1001complete2015-09-09T13:05:39+00002015-09-12T06:56:37+0000rack A
restbase1002complete2015-09-12T19:22:11+00002015-09-16T11:31:25+0000rack A
restbase1003complete2015-09-14T12:51:25+00002015-09-19T20:25:40+0000rack B
restbase1004interrupted2015-09-19T23:08:40+00002015-09-21T12:53:00+0000rack B
restbase1005interrupted2015-09-19T23:12:15+00002015-09-21T12:53:00+0000rack D
restbase1006in-progress2015-09-25T14:06:09+0000rack D; nodetool repair -dc eqiad
restbase1007interrupted2015-09-16T13:40:49+00002015-09-21T12:53:00+0000rack A
restbase1008
restbase1009

Event Timeline

Eevans created this task.Aug 10 2015, 6:32 PM
Eevans claimed this task.
Eevans raised the priority of this task from to Medium.
Eevans updated the task description. (Show Details)
Eevans added a project: RESTBase-Cassandra.
Eevans added a subscriber: Eevans.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 10 2015, 6:32 PM
Eevans renamed this task from perform initial repair of Cassandra cluster to perform initial (manual) repair of Cassandra cluster.Aug 10 2015, 6:33 PM
Eevans updated the task description. (Show Details)
Eevans set Security to None.

It'd be great to have this done prior to adding the new codfw nodes (T108953); If there are no objections, I'll begin issuing manual repairs tomorrow (nodetool repair -pr), working through the cluster one node at a time.

Eevans added a comment.Sep 9 2015, 1:06 PM

Repair of restbase1001 has started.

Eevans updated the task description. (Show Details)Sep 9 2015, 1:11 PM
Eevans updated the task description. (Show Details)Sep 10 2015, 7:26 PM
GWicke added a subscriber: GWicke.EditedSep 10 2015, 8:45 PM

Given the high costs for a full repair at these data sizes, it might be worth considering incremental repairs. There have been some bugs related to those early in the 2.1 cycle, but things might have improved since.

Given the high costs for a full repair at these data sizes, it might be worth considering incremental repairs. There have been some bugs related to those early in the 2.1 cycle, but things might have improved since.

Yes, I'd like to do exactly that, but have opted for the simplest kind of repair in order to get this first one out of the way without complication (ideally before we add the codfw nodes). Since we've never done repairs (and given all of the recent revision culling), the delta is presumably quite large (which I assume explains the long running nature of this first one).

The good news is that this doesn't appear to have much noticeable impact on node performance (we currently have plenty of headroom).

Eevans updated the task description. (Show Details)Sep 12 2015, 7:22 PM

Started repair of restbase1002.

Eevans updated the task description. (Show Details)Sep 14 2015, 12:52 PM

Started repair of restbase1003.

The repair of restbase1002 is complete.

There was output from nodetool that indicates an exception.

error: nodetool failed, check server logs
-- StackTrace --
java.lang.RuntimeException: nodetool failed, check server logs
        at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:290)
        at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:202)

The lone RepairException from the logs:

ERROR [Thread-58469] 2015-09-16 02:54:56,301 StorageService.java:2959 - Repair session b6b14860-5b93-11e5-a975-97f310c5c22c for range (-765749035276761727,-757983545592020469] failed with error org.apache.cassan
dra.exceptions.RepairException: [repair #b6b14860-5b93-11e5-a975-97f310c5c22c on local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ/data, (-765749035276761727,-757983545592020469]] Validation failed in /10.64.32.1
78
java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #b6b14860-5b93-11e5-a975-97f310c5c22c on local_group_wikipedia_T_parsoid_dataW4ULtxs1
oMqJ/data, (-765749035276761727,-757983545592020469]] Validation failed in /10.64.32.178
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.8.0_66-internal]
        at java.util.concurrent.FutureTask.get(FutureTask.java:192) [na:1.8.0_66-internal]
        at org.apache.cassandra.service.StorageService$4.runMayThrow(StorageService.java:2950) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [apache-cassandra-2.1.8.jar:2.1.8]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_66-internal]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_66-internal]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66-internal]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #b6b14860-5b93-11e5-a975-97f310c5c22c on local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ/data, (-765749035276761727,-757983545592020469]] Validation failed in /10.64.32.178
        at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.jar:na]
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) [apache-cassandra-2.1.8.jar:2.1.8]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_66-internal]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_66-internal]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_66-internal]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_66-internal]
        ... 1 common frames omitted
Caused by: org.apache.cassandra.exceptions.RepairException: [repair #b6b14860-5b93-11e5-a975-97f310c5c22c on local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ/data, (-765749035276761727,-757983545592020469]] Validation failed in /10.64.32.178
        at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:166) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:406) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:134) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) ~[apache-cassandra-2.1.8.jar:2.1.8]
        ... 3 common frames omitted

... though repairs continued for some time after, apparently completing successfully.

TL;DR I believe this failure was isolated to a single token range on this column family, and the repair completed successfully otherwise.

Eevans updated the task description. (Show Details)Sep 16 2015, 1:39 PM
Eevans updated the task description. (Show Details)Sep 16 2015, 1:41 PM
Eevans updated the task description. (Show Details)Sep 19 2015, 11:05 PM
Eevans updated the task description. (Show Details)Sep 19 2015, 11:10 PM
Eevans updated the task description. (Show Details)Sep 19 2015, 11:12 PM
Eevans updated the task description. (Show Details)Sep 21 2015, 2:02 PM
Eevans updated the task description. (Show Details)

As a result of our decision to hold off on a codfw rebuild until multi-instance is in place, I had planned to hold off on these repairs as well. However, I think we can continue by using the --dc eqiad parameter. Limiting partition and replica repairs to eqiad nodes won't do anything for the partition ranges now owned by codfw, but it will allow us to continue making some headway in the meantime.

@Eevans, where do you see this heading in the longer term? Do you think we can make incremental repairs work? Is it worth trying in staging?

@Eevans, where do you see this heading in the longer term? Do you think we can make incremental repairs work? Is it worth trying in staging?

Longer term, invocations should be regular and automated, and yes incremental repairs are definitely something I think we should look into.

Incremental repairs is a bigger step though, once we head down that path we're altering the way compaction works; Compaction will maintain pools of repaired and unrepaired SSTables, and with LCS, the unrepaired pool is size-tiered. I think that at the least, this will put some emphasis on the "regular" aspect (which in turn puts emphasis on the "automated"), if for no other reason than to keep that size-tiered pool relatively small.

The reason I haven't gone down that path yet, is that a) these initial repairs are quite extensive due to the amount of time we've run without them, and b) we've been introducing (and for the coming weeks will continue to introduce) a lot of change elsewhere into the cluster. I did not want to add incremental repair to the list of variables at play (for either of those things).

On top of this, I wasn't really sure incremental repair made the cut in terms of priority.

TL;DR

  1. Catch up on the repair backlog with manual invocations of nodetool repair -pr
  2. Once a repair can complete reliably in minutes or hours, schedule them via cron (appropriately offset from one another).
    1. Optionally with -par, to speed things up
  3. Investigate a conversion to incremental repair
  4. Investigate more sophisticated repair command-and-control, vis-a-vis https://github.com/spotify/cassandra-reaper

Note: 3 and 4 could be done in parallel.

Eevans updated the task description. (Show Details)Sep 25 2015, 2:06 PM
GWicke added a comment.EditedSep 25 2015, 6:07 PM

@Eevans, thanks for the background.

Do you see finishing one full round of repairs in eqiad as a precondition for the move to the multi-instance setup?

Once a repair can complete reliably in minutes or hours,

Do you see a chance for full repairs ever being this fast, given our dataset / instance sizes?

Incremental repairs is a bigger step though, once we head down that path we're altering the way compaction works; Compaction will maintain pools of repaired and unrepaired SSTables, and with LCS, the unrepaired pool is size-tiered.

That could actually work to our advantage by lowering write amplification, while also limiting # of sstables per read / guaranteeing timely compaction. But yeah, I agree that we'll need to thoroughly test it in staging. If that works okay, we could also consider proceeding to test it on a single production node for a while.

Eevans added a comment.EditedSep 25 2015, 6:21 PM

@Eevans, thanks for the background.

Do you see finishing one full round of repairs in eqiad as a precondition for the move to the multi-instance setup?

No, not at all.

Once a repair can complete reliably in minutes or hours,

Do you see a chance for full repairs ever being this fast, given our dataset / instance sizes?

I meant "minutes or hours" in the sense that the time required could be reasonably expressed in those units (as opposed to now, where the only reasonable unit is "days").

I'm also expecting instance sizes to start dropping dramatically as we roll out multi-instance.

Incremental repairs is a bigger step though, once we head down that path we're altering the way compaction works; Compaction will maintain pools of repaired and unrepaired SSTables, and with LCS, the unrepaired pool is size-tiered.

That could actually work to our advantage by lowering write amplification, while also limiting # of sstables per read / guaranteeing timely compaction. But yeah, I agree that we'll need to thoroughly test it in staging. If that works okay, we could also consider proceeding to test it on a single production node for a while.

I expect it to make SSTables per read higher than it is now, though.

GWicke added a comment.EditedSep 25 2015, 7:33 PM

I meant "minutes or hours" in the sense that the time required could be reasonably expressed in those units (as opposed to now, where the only reasonable unit is "days").

Right, that was my understanding too. Do you expect small instances to significantly reduce the repair cost per $unit of data, to the point of repairing the equivalent data of one current instance in less than "days"?

I expect it to make SSTables per read higher than it is now, though.

If we repair daily, then the depth of the unrepaired pool should be low-ish, though. Perhaps worth testing in staging? Running a full dump before & after and checking the sstables per read metric should give us an idea of what the impact would be.

I meant "minutes or hours" in the sense that the time required could be reasonably expressed in those units (as opposed to now, where the only reasonable unit is "days").

Right, that was my understanding too. Do you expect small instances to significantly reduce the repair cost per $unit of data, to the point of repairing the equivalent data of one current instance in less than "days"?

I don't know if it will reduce the repair cost per $unit, but it will certainly allow us to work through it with greater concurrency, and it will reduce the time it takes per instance (which makes scheduling them more tractable).

To be clear, the (vast) majority of the time spent during repair now is in anti-compaction, the result of having a lot of data to repair (a lot in terms of distribution, if not volume). There is no silver bullet for that other than getting the dataset into a better state of repair (which is the main purpose of these initial repairs).

as part of understanding better cassandra's behaviour (e.g. in T126221) I think we should also resurrect this task and look into regular incremental repairs.

I've resumed repairs, starting with restbase1008

root@restbase1008:~# nodetool-a repair -pr
[2016-02-24 14:54:47,903] Starting repair command #1, repairing 128 ranges for keyspace local_group_default_T_mobileapps_lead (parallelism=SEQUENTIAL, full=true)
[2016-02-24 14:55:47,913] Lost notification. You should check server log for repair status of keyspace local_group_default_T_mobileapps_lead
[2016-02-24 14:56:47,915] Lost notification. You should check server log for repair status of keyspace local_group_default_T_mobileapps_lead
[2016-02-24 14:57:47,916] Lost notification. You should check server log for repair status of keyspace local_group_default_T_mobileapps_lead
[2016-02-24 14:58:47,917] Lost notification. You should check server log for repair status of keyspace local_group_default_T_mobileapps_lead
[2016-02-24 14:59:47,917] Lost notification. You should check server log for repair status of keyspace local_group_default_T_mobileapps_lead
[2016-02-24 15:00:47,918] Lost notification. You should check server log for repair status of keyspace local_group_default_T_mobileapps_lead
[2016-02-24 15:01:47,920] Lost notification. You should check server log for repair status of keyspace local_group_default_T_mobileapps_lead
[2016-02-24 15:02:47,920] Lost notification. You should check server log for repair status of keyspace local_group_default_T_mobileapps_lead
[2016-02-24 15:03:47,922] Lost notification. You should check server log for repair status of keyspace local_group_default_T_mobileapps_lead
[2016-02-24 15:04:47,922] Lost notification. You should check server log for repair status of keyspace local_group_default_T_mobileapps_lead

the lost notification messages seem harmless in the output, also I can't find anything relating to errors in system.log

Mentioned in SAL [2016-02-25T04:35:22Z] <urandom> restarting restbase1008-a to cancel rebuild T108611 T119935

Mentioned in SAL [2016-02-26T20:04:37Z] <urandom> issuing test repair on cerium (restbase staging), keyspace : T108611

Mentioned in SAL [2016-02-27T21:19:26Z] <urandom> issuing test repair on cerium (restbase staging), wp_parsoid html keyspace, (-w 5) : T108611

I'd like to change things up a bit for subsequent repairs here. First though, a little background:

Repair in Cassandra works by constructing a merkle tree of a node's data and sending it to replicas for comparison. Data streams corresponding to the differences between trees are then exchanged to correct any discrepancies.

Since the resolution of a merkle tree is finite, some overstreaming in this second step should be expected (and some degree would be tolerable). However the depth of a tree in Cassandra is fixed so that the larger the data, the worse resolution becomes, and the more inefficient repair is in terms of both disk IO (validation compaction), and network IO (streaming). The solution to this is simple: Repair less data per session; The smaller the session is, the greater merkle tree resolution will be, and the more efficient it will become.

And it turns out that this is actually quite straightforward to do, as nodetool allows you to specify a sub-range using -st and -et arguments.

Even better, someone has even written a Python script which encapsulates this nicely. From the Github README:

The script works by figuring out the primary range for the node that it's being executed on, and instead of running repair on the entire range, run the repair on only a smaller sub-range. When a repair is initiated on a sub-range Cassandra constructs a merkle tree only for the range specified, which in turn divides the much smaller range into 15 segments. If there is disagreement in any of the hash values then a much smaller portion of data needs to be transferred which lessens load on the system.

I have experimented with this script in staging and the results are quite promising.

So I propose to do the following:

The expansion in eqiad rack 'a' is now complete. When the in-progress cleanups are done (presumably by Monday), things will quiescent there, and we'll have a reasonably clear lay of the land (at least insofar as sizing is concerned). Per-keyspace sub-range repairs can commence. Concurrency on these repairs can be carefully, and incrementally increased with an eye toward establishing a baseline for repair which is not impacting.

Mentioned in SAL [2016-03-14T18:56:06Z] <urandom> Starting Cassandra repairs on restbase1007-a.eqiad.wmnet : T108611

Eevans renamed this task from perform initial (manual) repair of Cassandra cluster to Perform initial (manual) repair of Cassandra cluster.Apr 29 2016, 8:28 PM
Eevans added a project: Cassandra.
Eevans closed this task as Resolved.Sep 20 2016, 9:00 PM