Page MenuHomePhabricator

Allow querying historical information on IPs
Open, Needs TriagePublic

Description

Our current extract-transform-load (ETL) logic will download $CURRENT and $PREVIOUS data files, diff them, and then figure out whether to INSERT, UPDATE, or DELETE items from the actor_data table (and related tables). It assumes that the database matches the state of the $PREVIOUS data file.

From the perspective of users querying ipoid (see T339279: [Epic] Display data from ipoid in IPInfo infobox) via IPInfo, there are at least two issues when a user investigates an IP address associated with an edit, and then edit was made e.g. 7 days ago:

  1. the IP address may no longer be present in $CURRENT data file, so it will be dropped from ipoid's database. That means IPInfo would not be able to show anything useful.
  2. the metadata associated with the IP address may have changed in meaningful ways between the initial data and what ipoid's current data shows. The user should probably be able to see both the original metadata at time of edit, and the current metadata.
    1. It's unclear how likely of a scenario this is, where the metadata changes so substantially that a user reviewing the data would make a different judgment based on the difference between metadata from e.g. 7 days earlier and today.

Soft delete

One idea to address these use cases is to introduce a deleted column to the actor_data table. Then we would shift away from issuing DELETE queries, and instead set a deleted = 1 value on an actor_data row. We'd probably want a script that cleans up rows with deleted = 1 that are older than N (30?) days to avoid using too much storage.

Archive table

However, that wouldn't solve the second use case, where the metadata changes. To handle that use case, we could keep the existing ETL logic in place as it is currently. But when we detect that a row in actor_data would be updated or deleted, we row insert a row into an archive table that contains columns for auto-incremented ID, IP (string), data (JSON blob), timestamp (timestamp of when the item was added to the table), is_deleted. A given IP could have multiple IPs in the table. We'd insert the row based on the contents of the line we're examining in $PREVIOUS file. We could have main.sh also handle clean up of rows in the archive table that are older than N days (30?).

When querying the service, the endpoint would check for both 1) current data (as it does presently) and 2) historical data in the archive table and return a set of info for both. Then it would be up to the caller (IPInfo for example) to figure out how it wants to present this information to the end user, including any diff of metadata.

The archive table would probably add some tens of millions of rows to the database. We'd have to think about how many days of historical data we want to preserve. And we'd need to observe in practice how quickly it grows, because we don't really have a great handle yet on how often IPs are updated, drop out of the dataset, reappear, etc.

Of the two approaches (soft delete vs archive table) it seems that setting up an archive table would be more straightforward and also handle both use cases.

Event Timeline

@Niharika @Madalina this task came out of discussions with @STran and @Tchanders. Could you please comment from the product side and confirm the assumptions about the need to support the use case of users accessing IP Info for an older edit (where "older" means more than 1 day) that no longer exists in the Spur database? From the product side, one scenario is:

  1. Day 1: User makes a vandalism edit from an IP that is known to Spur to be associated with a residential proxy service
  2. Day 2: the IP address drops out of Spur's database, because it is no longer associated with a residential proxy service
  3. Day 3: A CheckUser user inspects the IP address, and doesn't see any information that associates the IP with a residential proxy service

The proposals in this task would ensure that in step 3 above, the CheckUser sees the IP information associated with the edit at the time of the edit, not as of the current day. (cc @KColeman-WMF as well about the design implications of that for IP Info.)

I expect querying older IPs that no longer exist in Spur database would be a valid use case. I don't have a good sense of how important is it for end users (do they run into these IPs once a month? do they run into these IPs a few times a day?). I suggest we wait on user feedback before prioritizing this feature enhancement. We can make this an explicit focus when we gather user feedback after Spur is integrated into IP Info.

@kostajh and @Niharika I agree we should ask about this when we gather feedback after Spur is integrated. In the meantime, I've attached mockups for two options:

Option 1:

We don’t hold the data longer than 1 day.
We tell the user that Spur information is unavailable.

archival-data-option-1.png (552×932 px, 46 KB)

Option 2:

We hold data for longer (e.g 7 days).
We tell the user that this information was correct 7 days ago but is no longer valid.

archival-data-option-2.png (552×932 px, 44 KB)