Page MenuHomePhabricator

Implement batch tracking for imports
Closed, ResolvedPublic

Description

That's fine, technically. I think I have some concerns regarding recovering from drift. The ideal pipeline would be every day we pull a feed and we have confidence it's the latest feed. If it's not, I think things break? eg. we pull "today" but it's actually yesterday's. We pull "yesterday" and it's actually 2 days ago so the diff is actually re-creating the state of the db and things will fail because it'll try to insert duplicate IPs.

Yeah, I think we should use Date.now() as mentioned below, and avoid using latest in the call to the vendor.

🤔 Maybe I'm overthinking it? We get Date.now() and use that to look for a feed and if it doesn't exist, wait an hour and try again? Maybe that's drift risk-free?

I don't think you're overthinking it :)

One thing that I think would help is to have a dataset table that tracks an ID of the dataset we've imported, and its status. The dataset ID can be the date we've used to download, e.g. if calling https://feeds.spur.us/v2/anonymous-residential/20230625/feed.json.gz then we store 20230625 as a row in the dataset table, and we have a couple of statuses, e.g. "Complete", "Error", "In progress". (Then it would probably also make sense to reference this row in actor_data, so we can find when a given record was first added to the actor_data table.)

The import process can also use the dataset table to check for last successful import, so if we have a full import on 20230829, and 20230830 fails, then on 20230831 the script would be able to see that it needs to download and diff data from 8/29 and 8/31 instead of 8/30 and 8/31.

This will also help with hiccups in Kubernetes job creation that can allow jobs to sometimes either get duplicated or not fire at all.

So, I think the table could look something like:

IDStatus
20230829Complete
20230830Error
20230831In progress

Details

TitleReferenceAuthorSource BranchDest Branch
Log batch status and errorsrepos/mediawiki/services/ipoid!60strantrack-batch-progressmain
Customize query in GitLab

Event Timeline