Implement batch tracking for imports
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	STran
	Sep 6 2023, 2:01 AM

Description

In T325630#9136298, @kostajh wrote:

In T325630#9136264, @STran wrote:

That's fine, technically. I think I have some concerns regarding recovering from drift. The ideal pipeline would be every day we pull a feed and we have confidence it's the latest feed. If it's not, I think things break? eg. we pull "today" but it's actually yesterday's. We pull "yesterday" and it's actually 2 days ago so the diff is actually re-creating the state of the db and things will fail because it'll try to insert duplicate IPs.

Yeah, I think we should use Date.now() as mentioned below, and avoid using latest in the call to the vendor.

🤔 Maybe I'm overthinking it? We get Date.now() and use that to look for a feed and if it doesn't exist, wait an hour and try again? Maybe that's drift risk-free?

I don't think you're overthinking it :)

One thing that I think would help is to have a dataset table that tracks an ID of the dataset we've imported, and its status. The dataset ID can be the date we've used to download, e.g. if calling https://feeds.spur.us/v2/anonymous-residential/20230625/feed.json.gz then we store 20230625 as a row in the dataset table, and we have a couple of statuses, e.g. "Complete", "Error", "In progress". (Then it would probably also make sense to reference this row in actor_data, so we can find when a given record was first added to the actor_data table.)

The import process can also use the dataset table to check for last successful import, so if we have a full import on 20230829, and 20230830 fails, then on 20230831 the script would be able to see that it needs to download and diff data from 8/29 and 8/31 instead of 8/30 and 8/31.

This will also help with hiccups in Kubernetes job creation that can allow jobs to sometimes either get duplicated or not fire at all.

So, I think the table could look something like:

ID Status

20230829 Complete

20230830 Error

20230831 In progress

Details

	Title	Reference	Author	Source Branch	Dest Branch
	Log batch status and errors	repos/mediawiki/services/ipoid!60	stran	track-batch-progress	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
In Progress	• Niharika	T324492 Temporary accounts - MVP
Open	None	T340895 [Epic] IP Info accommodations for temporary accounts
Open	STran	T341395 Display Spur data on IPInfo infobox
Resolved	kostajh	T339284 Deploy ipoid
Resolved	kostajh	T340984 Fix problems with data import
Resolved	STran	T344749 Refactor import-db and update-db into one script
Resolved	STran	T341122 Implement daily data update routine
Resolved	STran	T345684 Implement batch tracking for imports