Page MenuHomePhabricator

Implement call to data vendor
Closed, ResolvedPublic8 Estimated Story Points

Description

As part of the prototype, security-api defines a FEED_PATH in docker-compose.yml and reads the file located there when it runs init-db.js to import the data into its database. There is currently no method to pull in updated data. As part of this ticket:

  • evaluate if the fixed FEED_PATH is the best practice for (temporarily?) storing a gzipped file
  • write a script that's expected to 1. call a third-party provider's API 2. save the return gzipped file somewhere 3. repeat this process on a schedule (systemd afaik)
  • ensure that init-db.js can still access the updated file, wherever it is

Dependencies:

Details

TitleReferenceAuthorSource BranchDest Branch
Retrieve today's feed from providerrepos/mediawiki/services/ipoid!54stranadd-feed-retrievalmain
Customize query in GitLab

Event Timeline

Niharika set the point value for this task to 8.Dec 20 2022, 5:26 PM

We'll need to do some additional work to round out the pipeline, which so far looks like this:

  • Daily (TODO):
  • run node ./get-feed.js which will download today's feed into tmp/today.json.gz
  • run ./diff.sh ./tmp/yesterday.json.gz ./tmp/today.json.gz which will create ./.tmp/statements.sql. It also creates, but then deletes, ./tmp/yesterday.json which is the sorted json. We might want to consider saving this to re-use although I think the code is easier to read if we just redo this step every day.
  • run ./import.sh which will create tmp/sub/query*.sql batches to run node ./update-db.js $FILE_PATH on.
  • TODO: We should rename today.json.gz to yesterday.json.gz in preparation for the next day's run.

We'll presumably need to write a single orchestration script to run all of these.

As part of the initial data import, we should seed a yesterday.json.gz file.

  • TODO: We should rename today.json.gz to yesterday.json.gz in preparation for the next day's run.

My understanding of the Kubernetes cron job deployment in which this script will run is that there will be no files left over after script completion; each daily run will start with a new context and filesystem. If that understanding is correct, then on each run, we'll need to download yesterday and today's files.

  • TODO: We should rename today.json.gz to yesterday.json.gz in preparation for the next day's run.

My understanding of the Kubernetes cron job deployment in which this script will run is that there will be no files left over after script completion; each daily run will start with a new context and filesystem. If that understanding is correct, then on each run, we'll need to download yesterday and today's files.

SRE have confirmed this is the case.

That's fine, technically. I think I have some concerns regarding recovering from drift. The ideal pipeline would be every day we pull a feed and we have confidence it's the latest feed. If it's not, I think things break? eg. we pull "today" but it's actually yesterday's. We pull "yesterday" and it's actually 2 days ago so the diff is actually re-creating the state of the db and things will fail because it'll try to insert duplicate IPs.

🤔 Maybe I'm overthinking it? We get Date.now() and use that to look for a feed and if it doesn't exist, wait an hour and try again? Maybe that's drift risk-free?

That's fine, technically. I think I have some concerns regarding recovering from drift. The ideal pipeline would be every day we pull a feed and we have confidence it's the latest feed. If it's not, I think things break? eg. we pull "today" but it's actually yesterday's. We pull "yesterday" and it's actually 2 days ago so the diff is actually re-creating the state of the db and things will fail because it'll try to insert duplicate IPs.

Yeah, I think we should use Date.now() as mentioned below, and avoid using latest in the call to the vendor.

🤔 Maybe I'm overthinking it? We get Date.now() and use that to look for a feed and if it doesn't exist, wait an hour and try again? Maybe that's drift risk-free?

I don't think you're overthinking it :)

One thing that I think would help is to have a dataset table that tracks an ID of the dataset we've imported, and its status. The dataset ID can be the date we've used to download, e.g. if calling https://feeds.spur.us/v2/anonymous-residential/20230625/feed.json.gz then we store 20230625 as a row in the dataset table, and we have a couple of statuses, e.g. "Complete", "Error", "In progress". (Then it would probably also make sense to reference this row in actor_data, so we can find when a given record was first added to the actor_data table.)

The import process can also use the dataset table to check for last successful import, so if we have a full import on 20230829, and 20230830 fails, then on 20230831 the script would be able to see that it needs to download and diff data from 8/29 and 8/31 instead of 8/30 and 8/31.

This will also help with hiccups in Kubernetes job creation that can allow jobs to sometimes either get duplicated or not fire at all.

So, I think the table could look something like:

IDStatus
20230829Complete
20230830Error
20230831In progress

Chatted with @Tchanders and we have a general consensus that this direction is where we want to head. I'm going to punt the key req to T339331: Prepare for initial data import on production servers and document this work as part of T341122: Implement daily data update routine which seems to better encapsulate our current problem (frail imports).