Implement call to data vendor
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	STran
	Dec 20 2022, 1:22 PM

Description

As part of the prototype, security-api defines a FEED_PATH in docker-compose.yml and reads the file located there when it runs init-db.js to import the data into its database. There is currently no method to pull in updated data. As part of this ticket:

evaluate if the fixed FEED_PATH is the best practice for (temporarily?) storing a gzipped file
write a script that's expected to 1. call a third-party provider's API 2. save the return gzipped file somewhere 3. repeat this process on a schedule (systemd afaik)
ensure that init-db.js can still access the updated file, wherever it is

Dependencies:

need to have the contract and key from the vendor
~~need to have a secure place to store the key~~ see T339331: Prepare for initial data import on production servers

Details

	Title	Reference	Author	Source Branch	Dest Branch
	Retrieve today's feed from provider	repos/mediawiki/services/ipoid!54	stran	add-feed-retrieval	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
In Progress	Niharika	T324492 Temporary accounts - MVP
Open	None	T340895 [Epic] IP Info accommodations for temporary accounts
Open	STran	T341395 Display Spur data on IPInfo infobox
Resolved	kostajh	T339284 Deploy ipoid
Resolved	kostajh	T340984 Fix problems with data import
Resolved	STran	T344749 Refactor import-db and update-db into one script
Resolved	STran	T341122 Implement daily data update routine
Resolved	STran	T325630 Implement call to data vendor

Event Timeline

STran created this task.Dec 20 2022, 1:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 20 2022, 1:22 PM

Niharika set the point value for this task to 8.Dec 20 2022, 5:26 PM

Tchanders mentioned this in T325640: Investigate: Implement error handling.Jun 8 2023, 6:33 AM

STran mentioned this in T325635: Investigate: Alternatives or improvements to data import method.Jun 15 2023, 9:28 AM

STran mentioned this in T339331: Prepare for initial data import on production servers.Jun 16 2023, 8:05 AM

STran added a parent task: T339279: [Epic] Display data from ipoid in IPInfo infobox.Jun 16 2023, 8:36 AM

• AGueyte moved this task from Untriaged to Similar Editors on the Anti-Harassment board.Jul 12 2023, 6:38 PM

• AGueyte moved this task from Similar Editors to Untriaged on the Anti-Harassment board.Jul 13 2023, 11:41 AM

STran claimed this task.Aug 17 2023, 9:50 AM

STran edited projects, added Anti-Harassment (AHaT Sprint 32 - Baseball Cap); removed Anti-Harassment.

STran moved this task from Backlog to In Progress on the iPoid-Service board.Aug 17 2023, 9:53 AM

STran moved this task from Ready 🎬 (ONLY IF YOU HAVE NO MORE CODE TO REVIEW) to In Progress 💪 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.

Tchanders added a parent task: T341122: Implement daily data update routine.Aug 23 2023, 1:55 PM

Tchanders removed a parent task: T339279: [Epic] Display data from ipoid in IPInfo infobox.Aug 23 2023, 1:58 PM

stran opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/54

Retrieve today's feed from provider

STran moved this task from In Progress 💪 to Code Review 🔍 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.Aug 29 2023, 5:36 PM

Tchanders updated the task description. (Show Details)Aug 30 2023, 12:28 PM

We'll need to do some additional work to round out the pipeline, which so far looks like this:

Daily (TODO):
run node ./get-feed.js which will download today's feed into tmp/today.json.gz
run ./diff.sh ./tmp/yesterday.json.gz ./tmp/today.json.gz which will create ./.tmp/statements.sql. It also creates, but then deletes, ./tmp/yesterday.json which is the sorted json. We might want to consider saving this to re-use although I think the code is easier to read if we just redo this step every day.
run ./import.sh which will create tmp/sub/query*.sql batches to run node ./update-db.js $FILE_PATH on.
TODO: We should rename today.json.gz to yesterday.json.gz in preparation for the next day's run.

We'll presumably need to write a single orchestration script to run all of these.

As part of the initial data import, we should seed a yesterday.json.gz file.

In T325630#9133011, @STran wrote:

TODO: We should rename today.json.gz to yesterday.json.gz in preparation for the next day's run.

My understanding of the Kubernetes cron job deployment in which this script will run is that there will be no files left over after script completion; each daily run will start with a new context and filesystem. If that understanding is correct, then on each run, we'll need to download yesterday and today's files.

tchanders merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/54

Retrieve today's feed from provider

Maintenance_bot removed a project: Patch-For-Review.Aug 31 2023, 9:10 AM

In T325630#9133045, @kostajh wrote:

In T325630#9133011, @STran wrote:

TODO: We should rename today.json.gz to yesterday.json.gz in preparation for the next day's run.

My understanding of the Kubernetes cron job deployment in which this script will run is that there will be no files left over after script completion; each daily run will start with a new context and filesystem. If that understanding is correct, then on each run, we'll need to download yesterday and today's files.

SRE have confirmed this is the case.

That's fine, technically. I think I have some concerns regarding recovering from drift. The ideal pipeline would be every day we pull a feed and we have confidence it's the latest feed. If it's not, I think things break? eg. we pull "today" but it's actually yesterday's. We pull "yesterday" and it's actually 2 days ago so the diff is actually re-creating the state of the db and things will fail because it'll try to insert duplicate IPs.

🤔 Maybe I'm overthinking it? We get Date.now() and use that to look for a feed and if it doesn't exist, wait an hour and try again? Maybe that's drift risk-free?

stran opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/59

Refactor import-data.js

STran moved this task from Code Review 🔍 to QA/Testing 🐞 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.Sep 1 2023, 10:00 AM

In T325630#9136264, @STran wrote:

That's fine, technically. I think I have some concerns regarding recovering from drift. The ideal pipeline would be every day we pull a feed and we have confidence it's the latest feed. If it's not, I think things break? eg. we pull "today" but it's actually yesterday's. We pull "yesterday" and it's actually 2 days ago so the diff is actually re-creating the state of the db and things will fail because it'll try to insert duplicate IPs.

Yeah, I think we should use Date.now() as mentioned below, and avoid using latest in the call to the vendor.

🤔 Maybe I'm overthinking it? We get Date.now() and use that to look for a feed and if it doesn't exist, wait an hour and try again? Maybe that's drift risk-free?

I don't think you're overthinking it :)

One thing that I think would help is to have a dataset table that tracks an ID of the dataset we've imported, and its status. The dataset ID can be the date we've used to download, e.g. if calling https://feeds.spur.us/v2/anonymous-residential/20230625/feed.json.gz then we store 20230625 as a row in the dataset table, and we have a couple of statuses, e.g. "Complete", "Error", "In progress". (Then it would probably also make sense to reference this row in actor_data, so we can find when a given record was first added to the actor_data table.)

The import process can also use the dataset table to check for last successful import, so if we have a full import on 20230829, and 20230830 fails, then on 20230831 the script would be able to see that it needs to download and diff data from 8/29 and 8/31 instead of 8/30 and 8/31.

This will also help with hiccups in Kubernetes job creation that can allow jobs to sometimes either get duplicated or not fire at all.

So, I think the table could look something like:

ID	Status
20230829	Complete
20230830	Error
20230831	In progress

Chatted with @Tchanders and we have a general consensus that this direction is where we want to head. I'm going to punt the key req to T339331: Prepare for initial data import on production servers and document this work as part of T341122: Implement daily data update routine which seems to better encapsulate our current problem (frail imports).

I lied that ticket already went into QA. Spun up T345684: Implement batch tracking for imports instead.

Tchanders closed this task as Resolved.Sep 6 2023, 5:02 PM

dom_walden moved this task from QA/Testing 🐞 to Done Q1 2023-2024 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.Sep 28 2023, 1:53 PM

STran moved this task from In Progress to Done on the iPoid-Service board.Oct 4 2023, 6:22 AM

Implement call to data vendorClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Implement call to data vendor
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...