Page MenuHomePhabricator

DELETE mechanism for Cassanda Analytics datasets
Open, Needs TriagePublic

Description

In T364583: Consider what mechanism to use to make date deletion possible, we concluded that we should build a mechanism to DELETE data from Cassandra instead of setting TTLs.

Such a mechanism will benefit the Pageviews tables, as well as the Commons Impact Metrics tables. It could help other analytics assets too.

Some notes:

For Commons Impact Metrics, considering that we have the source data available in the datalake, we could build point query DELETEs for each row that we want gone by simply running the HQL associated with the transformation (say, for load_cassandra_commons_top_editors_monthly.hql) and then feeding the columns that define the Cassandra PRIMARY KEY to the Spark-Cassandra connector deleteFromCassandra() mechanism. Presumably these point DELETEs should take as much time as the INSERTs take: not more than a couple minutes.

For Pageviews, Eric notes that the schema itself could allow range deletes:

...
The way the schema is right now, the dataset can grow unbounded, yes, but other than storage resources used, it presents no concern. If we changed it, and moved year and month out of the partition key, then partitions will grow unbounded. It would probably take many years (many more than 5), but on a long enough timeline, that will eventually cause problems. So my (relatively minor) concern would be kicking the can so far down the road that nothing is ever done, and that one day our successors look back on us with contempt. 😀 In other words, we should probably make reasonable efforts to make sure that —if we do this— we make some reasonable effort to implement the deletes. I think we have some more immediate motivation to do so too, @JAllemandou has indicated a desire to cull records from pageviews, and the schema there already permits this. Anything we created could (nay should) be generic enough to accommodate all of these.

So at first glance it seems it would be easier to build DELETEs mechanisms separately.

In this task, we should:

  • Figure out if it makes sense to build a generic mechanism to tackle DELETEs
    • If not, then build individual DELETE mechanisms
    • If yes, then build such mechanism.
  • Test this mechanism againt the Cassandra Staging environment
  • Once we are happy, deploy to prod, likely via Airflow DAG

Event Timeline

Just for posterity sake:

Some of the Commons Impact Metrics tables already accommodate doing range deletes. Those that do not could be made to do so by moving year and month attributes into the composite key. This would make for fewer —wider— partitions, but not unacceptably so (at least based on the example discussed in T364583). Using range deletes is cheap, because you're writing a single tombstone to remove an entire range of values. As far as I can tell, this change to schema could be made without having to update any of the affected code (i.e. it changes nothing, except for how Cassandra organizes the data). It is not a change that can be made once data is in place though. In other words, now seems like an ideal time; Doing this later would require migration.

Point deletes will work too, though I suspect the tooling will end up being more bespoke, than something range based. It's also less efficient/elegant; It means a doubling of the number of transactions (however many are inserted in a day, would be the number of deletes to remove everything 5 years + 1 days old).

It is not a change that can be made once data is in place though. In other words, now seems like an ideal time; Doing this later would require migration.

Fair enough, reopened T364583 to discuss that issue.