In T364583: Consider what mechanism to use to make date deletion possible, we concluded that we should build a mechanism to DELETE data from Cassandra instead of setting TTLs.
Such a mechanism will benefit the Pageviews tables, as well as the Commons Impact Metrics tables. It could help other analytics assets too.
Some notes:
For Commons Impact Metrics, considering that we have the source data available in the datalake, we could build point query DELETEs for each row that we want gone by simply running the HQL associated with the transformation (say, for load_cassandra_commons_top_editors_monthly.hql) and then feeding the columns that define the Cassandra PRIMARY KEY to the Spark-Cassandra connector deleteFromCassandra() mechanism. Presumably these point DELETEs should take as much time as the INSERTs take: not more than a couple minutes.
For Pageviews, Eric notes that the schema itself could allow range deletes:
So at first glance it seems it would be easier to build DELETEs mechanisms separately.
In this task, we should:
- Figure out if it makes sense to build a generic mechanism to tackle DELETEs
- If not, then build individual DELETE mechanisms
- If yes, then build such mechanism.
- Test this mechanism againt the Cassandra Staging environment
- Once we are happy, deploy to prod, likely via Airflow DAG