Page MenuHomePhabricator

Set up regular-repairs for AQS cassandra cluster tables
Open, LowPublic

Description

The migration of AQS data from cassandra2 to cassandra3 showed us some small inconsistency in the data handled by cassandra2. This was expected as we don't have regular repair of our tables in place. This nonetheless lead to complications in data movement as we had to do repairs before taking snapshots of the tables, and for the biggest one took a different strategy as we expected the repair to be extra-long.
This task is about setting up a regular and possibly incremental repair process for AQS tables so that we can have fully consistent state as needed on the cluster.

Event Timeline

Repair is such a...complex, subject. So much so that I'm not sure how to do justice to all of the considerations in a phab comment. :/

Let me try this first as Devil's Advocate:

  • Repair is really expensive, and that is across every dimension (cpu, network & storage io)
    • What impact will this have?
      • On cluster utilization?
      • On performance?
    • ¯\_(ツ)_/¯
  • Full repairs of these datasets are almost certainly intractable
  • Incremental repairs create separate pools of repaired v. unrepaired files, each of which needs it's own compaction pipeline
    • What impact will this have?
      • On cluster utilization?
      • On performance?
    • ¯\_(ツ)_/¯
  • Cluster-wide repair requires sufficient coordination and tracking so as to require it's own software infrastructure (reaper, for example)
  • With 3 replicas + QUORUM writes (and hinted hand-off enabled), actual inconsistencies will be very rare
    • With QUORUM reads (and read-repair enabled), the possibility of any client seeing these inconsistencies is even rarer (basically non-existent?)
      • Given the nature of this dataset, any inconsistency would result in an absence of data, as opposed to incorrect data
  • The cluster migration scenario is an exceptional occurrence
    • And one we should try to avoid ever doing again...
    • And can be dealt with by importing from all replicas (this is what we did, yes?)
  • This is secondary data; Results can always be regenerated
    • Question: Is this true? I've wondered for a while now if we are (or are ever planning to) have AQS retention that exceeds that of the underlying data.

TL;DR From a cost:benefit perspective, is this worth it?

The cluster migration took about 2-3 months, and while little of that time was spent in active maintenance/work it was a long lasting background task that took up cycles and fair amount of switching. With Cassandra 4 already out there is a likely upgrade in the near future. I would really like us to avoid having to go through the motions again.

As we decide on whether to switch on repair here are some additional questions and considerations:

  • What was the nature and percentage of missing data?
  • Given the kind of the tables that we have would it be an acceptable level of data loss?
  • What would recreating the data entail - would that possibly be even more costly than doing the multiple copies?
  • For some of the new generated dataset usecases is potential future data loss ok?
  • Can repair be switched on for individual tables?
  • If we are not sure what the performance cost of running repair would be, now may be a good time to give it a try before we completely migrate to the new cluster, however if data loss is minimal and acceptable we can define an SLA and leave as is
odimitrijevic moved this task from Incoming (new tickets) to Serve on the Data-Engineering board.

Marking as high given the timeliness of the decision as we switch to the new cluster.

@JAllemandou can you please share info (or link) on the data loss observed.

Adding my views on the questions from Olja (reordering questions as well):

  • What was the nature and percentage of missing data?

After loading data from 4 instances (a rack), we were missing some rows on almost all tables. The number of missing rows was small/very small in proportion of the number of total rows (I didn't compute ratios), but was mostly concentrated on some days (I assume those days were moments where the cluster experienced issues).

  • What would recreating the data entail - would that possibly be even more costly than doing the multiple copies?

We keep on the cluster original data for everything loaded to cassandra. The concern about reloading is that we wouldn't know what to reload, and that full reload would be expensive (for big tables, but repair isn't really a problem for small tables either, see last question below).

  • Given the kind of the tables that we have would it be an acceptable level of data loss?

This is a complicated question. the data we store in AQS is analytics, therefore not "primary wiki" data. In that regard, we could easily say that it's less important and that data loss is ok (we've already done so). Nonetheless, knowing we have the data and the systems, trying our best to make the data available as we have it is what I'd expect from us.

  • For some of the new generated dataset use cases is potential future data loss ok?

AFAIK, the generated datasets use cases involve reloading/updating data, while our use-cases is incremental. This is different in that data on the generated-dataset size don't grow that much (but change), and can more easily be reloaded.

  • Can repair be switched on for individual tables?

Giving my view on this even if Eric's is probably more precise: repair is meant to be done on individual tables. Having regular/automated/incremental repairs would require automation, on which we should be able to parameterize which table to process.

  • If we are not sure what the performance cost of running repair would be, now may be a good time to give it a try before we completely migrate to the new cluster, however if data loss is minimal and acceptable we can define an SLA and leave as is

Ben repaired all tables but the biggest pageview_per_article_flat when we realized there was data missing with the rack-snapshot approach. IIRC the repair operation on 4 instances (the rack to snapshot) took a relatively small amount of time for most of the tables as they don't contain a lot of data (up to hours). It then took a few days to repair the second biggest table mediarequest_per_file. Finally Ben started the repair process the biggest table, but we decided to give up and to use the all-instances-snapshot option after half of repair on a single instance having reported ~10% done (@BTullis can you confirm that please? I don't trust my memory :).

At the end, the discussion is not really about repairs for small tables - those are cheap enough (and we're not even talking about incremental). The concern is for the big tables for which incremental repair would probably be beneficial.

@JAllemandou and @BTullis what are next steps on this work? Is this still valid (the task is quite old).

The task is old but the objective is still valid IMO.
We should talk to @Eevans about this.

Eevans lowered the priority of this task from High to Low.Apr 16 2024, 3:19 PM
Eevans added a project: Cassandra.
Eevans added a subscriber: KOfori.

We've made the upgrade to 4.x already, and we did so without a migration. If I've understood the context above, that was the reason for elevating the priority, so I'm going to drop it down now. Please fee free to readjust if that's wrong.


As for the work itself: I (continue) to question whether or not it's worth it in this case (I'm not saying that it is not, only that it warrants a discussion). Implementing this will be a relatively heavy-lift. It requires implementing infrastructure for monitoring and coordination, and we'd need to move slow and test carefully to ensure the workloads behave as expected.

That said, having the capability to schedule repairs would almost certainly be a requirement for some primary data workloads, and not having this capability might discourage those from ever existing. As much as I hate to say it, sometimes you have to build it, before they come. :)