Our current extract-transform-load (ETL) logic will download $CURRENT and $PREVIOUS data files, diff them, and then figure out whether to INSERT, UPDATE, or DELETE items from the actor_data table (and related tables). It assumes that the database matches the state of the $PREVIOUS data file.
From the perspective of users querying ipoid (see T339279: [Epic] Display data from ipoid in IPInfo infobox) via IPInfo, there are at least two issues when a user investigates an IP address associated with an edit, and then edit was made e.g. 7 days ago:
- the IP address may no longer be present in $CURRENT data file, so it will be dropped from ipoid's database. That means IPInfo would not be able to show anything useful.
- the metadata associated with the IP address may have changed in meaningful ways between the initial data and what ipoid's current data shows. The user should probably be able to see both the original metadata at time of edit, and the current metadata.
- It's unclear how likely of a scenario this is, where the metadata changes so substantially that a user reviewing the data would make a different judgment based on the difference between metadata from e.g. 7 days earlier and today.
Soft delete
One idea to address these use cases is to introduce a deleted column to the actor_data table. Then we would shift away from issuing DELETE queries, and instead set a deleted = 1 value on an actor_data row. We'd probably want a script that cleans up rows with deleted = 1 that are older than N (30?) days to avoid using too much storage.
Archive table
However, that wouldn't solve the second use case, where the metadata changes. To handle that use case, we could keep the existing ETL logic in place as it is currently. But when we detect that a row in actor_data would be updated or deleted, we row insert a row into an archive table that contains columns for auto-incremented ID, IP (string), data (JSON blob), timestamp (timestamp of when the item was added to the table), is_deleted. A given IP could have multiple IPs in the table. We'd insert the row based on the contents of the line we're examining in $PREVIOUS file. We could have main.sh also handle clean up of rows in the archive table that are older than N days (30?).
When querying the service, the endpoint would check for both 1) current data (as it does presently) and 2) historical data in the archive table and return a set of info for both. Then it would be up to the caller (IPInfo for example) to figure out how it wants to present this information to the end user, including any diff of metadata.
The archive table would probably add some tens of millions of rows to the database. We'd have to think about how many days of historical data we want to preserve. And we'd need to observe in practice how quickly it grows, because we don't really have a great handle yet on how often IPs are updated, drop out of the dataset, reappear, etc.
Of the two approaches (soft delete vs archive table) it seems that setting up an archive table would be more straightforward and also handle both use cases.