Page MenuHomePhabricator

Remove old data from instanceof_cache and title_cache in image_suggestions.suggestions in Cassandra
Closed, DeclinedPublic

Description

Before a ttl was set on the instanceof_cache and title_cache tables in Cassandra, plenty of old data was written, and that data persists in the tables

Most of this can be cleaned up in a similar manner to how the suggestions data was cleaned up in https://phabricator.wikimedia.org/T317364#8400180 - i.e.

  • gather all wiki/page_id/rev_page combinations for all suggestions in image_suggestions_suggestions in Hive for all snapshots
  • left_anti join them with all wiki/page_id/rev_page combinations in image_suggestions_suggestions in Hive from the latest snapshot
  • write the results to a csv
  • use a python script to read the csv one row at a time and delete from Cassandra

Event Timeline

Per T317364#8415200, let's hold off until we're sure this data can't be used to suss out T317364

I think that I've gotten what I need from instanceof_cache & title_cache, they can be cleaned up.

@Cparle how do you plan to do this BTW? Unlike suggestions that supports historical results, these tables only have a single entry. Any time a preexisting record has been overwritten, it will have a TTL (any that have been overwritten since we instated the TTL anyway); The records you'd want to clean up are those written without a TTL, that haven't been (over)written since. Is this something you can suss out from Hive? Is there anyway that we can just TRUNCATE these tables?

Yeah we should be able to suss it out from Hive I think ... I don't think we can truncate because Growth are using the data all the time

MarkTraceur subscribed.

Declining as unnecessary per discussion during our estimation meeting.