In T325630#9136298, @kostajh wrote:In T325630#9136264, @STran wrote:That's fine, technically. I think I have some concerns regarding recovering from drift. The ideal pipeline would be every day we pull a feed and we have confidence it's the latest feed. If it's not, I think things break? eg. we pull "today" but it's actually yesterday's. We pull "yesterday" and it's actually 2 days ago so the diff is actually re-creating the state of the db and things will fail because it'll try to insert duplicate IPs.
Yeah, I think we should use Date.now() as mentioned below, and avoid using latest in the call to the vendor.
🤔 Maybe I'm overthinking it? We get Date.now() and use that to look for a feed and if it doesn't exist, wait an hour and try again? Maybe that's drift risk-free?
I don't think you're overthinking it :)
One thing that I think would help is to have a dataset table that tracks an ID of the dataset we've imported, and its status. The dataset ID can be the date we've used to download, e.g. if calling https://feeds.spur.us/v2/anonymous-residential/20230625/feed.json.gz then we store 20230625 as a row in the dataset table, and we have a couple of statuses, e.g. "Complete", "Error", "In progress". (Then it would probably also make sense to reference this row in actor_data, so we can find when a given record was first added to the actor_data table.)
The import process can also use the dataset table to check for last successful import, so if we have a full import on 20230829, and 20230830 fails, then on 20230831 the script would be able to see that it needs to download and diff data from 8/29 and 8/31 instead of 8/30 and 8/31.
This will also help with hiccups in Kubernetes job creation that can allow jobs to sometimes either get duplicated or not fire at all.
So, I think the table could look something like:
ID Status 20230829 Complete 20230830 Error 20230831 In progress
Description
Description
Details
Details
Title | Reference | Author | Source Branch | Dest Branch | |
---|---|---|---|---|---|
Log batch status and errors | repos/mediawiki/services/ipoid!60 | stran | track-batch-progress | main |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
In Progress | • Niharika | T324492 Temporary accounts - MVP | |||
Open | None | T340895 [Epic] IP Info accommodations for temporary accounts | |||
Open | STran | T341395 Display Spur data on IPInfo infobox | |||
Resolved | kostajh | T339284 Deploy ipoid | |||
Resolved | kostajh | T340984 Fix problems with data import | |||
Resolved | STran | T344749 Refactor import-db and update-db into one script | |||
Resolved | STran | T341122 Implement daily data update routine | |||
Resolved | STran | T345684 Implement batch tracking for imports |
Event Timeline
Comment Actions
stran opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/60
Log batch status and errors
Comment Actions
tchanders merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/60
Log batch status and errors