Implement daily data update routine
Closed, ResolvedPublic
Actions

Description

In this task, we'll create the script for updating the ipoid database with $CURENT_DAY data, assuming that we already have a full set of data in the DB.

See T305724: Investigate database data invalidation questions and chunked/timed API to MySQL/MariaDB ETL and T325635: Investigate: Alternatives or improvements to data import method for previous discussion. Copying from T305724#8917809:

Initial import, get the ~25m records into the DB (all INSERT queries). Tracked in T339331: Prepare for initial data import on production servers
Daily ingest of data (cron job in Kubernetes):
- Download dump file for today as $CURRENT_DAY_DUMP_FILE
- Iterate over all lines in $CURRENT_DAY_DUMP_FILE and do a SELECT query against the DB.
  - If we find a result:
    - the metadata matches what is in $CURRENT_DAY_DUMP_FILE: do nothing, the data's current
    - the metadata doesn't match: UPDATE the record and associated metadata records as needed
  - If we don't find a result
    - It's a new record, run an INSERT to add to the DB
- Download dump file for yesterday as $YESTERDAY_DUMP_FILE
- Iterate over all lines in $YESTERDAY_DUMP_FILE, find lines that do not exist in $CURRENT_DAY_DUMP_FILE
  - Issue DELETE queries for all lines that are in $YESTERDAY_DUMP_FILE and not in $CURRENT_DAY_DUMP_FILE, as those are stale data

Note that the script processing the updates should add pauses to so that the entire process spans at least 6 hours. See T340516: Investigate diff expectations for updating imported data for details on expected number of new entries, updates, and deletions; tl;dr it's about 2M new entries, 6M updates, and 2M deletions.

Details

Title	Reference	Author	Source Branch	Dest Branch
Import the behaviors, proxies and tunnels that are in the data files	repos/mediawiki/services/ipoid!68	tchanders	get-properties	main
Remove concentration skew/density data	repos/mediawiki/services/ipoid!35	stran	remove-conc-data	main
Add a script for outputting diff stats between two daily data dumps	repos/mediawiki/services/ipoid!20	tchanders	diffing	main

Customize query in GitLab

Related Objects
Search...

Status	Subtype	Assigned	Task
			Restricted Task
Resolved		kostajh	T294511 2021 Security Team wikireplicas audit
Declined		None	T284948 Raw IPs of logged-out users disclosed in wiki-replicas
In Progress		Niharika	T324492 Temporary accounts - MVP
Open		None	T340895 [Epic] IP Info accommodations for temporary accounts
Open		STran	T341395 Display Spur data on IPInfo infobox
Resolved		kostajh	T339284 Deploy ipoid
Resolved		kostajh	T340984 Fix problems with data import
Resolved		STran	T344749 Refactor import-db and update-db into one script
Resolved		STran	T341122 Implement daily data update routine
Resolved		Tchanders	T344517 Fix statements generated for updated IP data
Resolved		Tchanders	T344520 Remove temporary files to avoid taking up unnecessary disk space
Resolved		STran	T344272 Use batching for daily data import
Resolved		STran	T325630 Implement call to data vendor
Resolved		Tchanders	T344499 Move behaviors/proxies/tunnels processing into an independent script
Declined		None	T344939 Log when the lookup fails on attempting to INSERT to actor_data (because the proxy/behaviour doesn't exist)
Resolved		Tchanders	T344921 Ensure dependent SQL statements are all run in the same batch
Resolved		Tchanders	T344981 Map tables should not have null for any of their values
Resolved		STran	T345757 Create orchestration script for daily data update
Resolved		None	T345761 Turn scripts into modules that export
Resolved		STran	T345684 Implement batch tracking for imports
Resolved	BUG REPORT	Tchanders	T341660 Investigate non-deterministic behaviour of import

Event Timeline

kostajh created this task.Jul 5 2023, 11:03 AM

kostajh mentioned this in T305724: Investigate database data invalidation questions and chunked/timed API to MySQL/MariaDB ETL.

STran claimed this task.Jul 7 2023, 8:54 AM

STran edited projects, added Anti-Harassment (AHaT Sprint 32 - Baseball Cap); removed Anti-Harassment.

STran moved this task from Ready 🎬 (ONLY IF YOU HAVE NO MORE CODE TO REVIEW) to In Progress 💪 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.

jijiki mentioned this in T339331: Prepare for initial data import on production servers.Jul 7 2023, 4:11 PM

From T339331#8992755:

Given that we have the ability to provide as many resources as the pod needs, in which case, we can make it possible to load the whole dump in memory, if that would help with our current challenges.

@jijiki Does this only apply to the initial import, or could we do something similar for the daily updates - i.e. have the two dumps uncompressed in memory? I'm wondering about faster ways to compare the two dumps without reading from the database. We have a diffing script that works quickly, but uses a lot of storage space, and I'm wondering if it's at all worth pursuing.

@Tchanders we could potentially provide those resources, as long as they are for a limited amount of time, eg 2hrs.

In the case where we will be comparing two full dumps (downloaded and uncompressed during execution):

how much memory and storage would you need?
what would be the duration of this job?

stran opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/35

Remove concentration skew/density data

In T341122#9005618, @jijiki wrote:

@Tchanders we could potentially provide those resources, as long as they are for a limited amount of time, eg 2hrs.

In the case where we will be comparing two full dumps (downloaded and uncompressed during execution):

how much memory and storage would you need?

what would be the duration of this job?

Rough numbers from running a draft script locally:

RAM: a few GB (<5)
Storage: c.20GB
Duration: c.30 minutes

If these numbers are way out, we can probably bring some down at the expense of others (e.g. less memory, more time). I'd be interested to understand ballpark limits if possible. (And sorry if these are way, way out! I don't have much insight into what's available.)

The other approach we're looking at has much lower RAM and storage needs, but runs over several hours and uses millions of database reads.

Adding @RLazarus, since I believe @jijiki is out.

Those numbers don't immediately raise alarm bells for me -- "storage" doesn't mean anything persistent, only ephemeral data that can disappear when the script exits, right? As long as that's the case (and assuming you're using ~1 CPU), you should be fine. I'm tagging in @akosiaris to confirm the resource request is sensible.

One other piece to consider, wrt ephemeral storage: you should expect it to go away even if the script doesn't complete successfully -- maybe it crashes, or maybe the scheduler needs to move it because the host machine is shutting down for maintenance, or whatever. That won't happen very often, but you want to make sure it's a non-disaster: assuming the answer is "we'll just rerun the script and it'll refresh everything" or even "we'll leave it until the next daily automatic run" that's fine. But if it were something like "we can't re-download that dump, we only stored it ephemerally, and if anything happens it'll be gone forever" (or even "we'll have committed a partial update and the inconsistent data is a problem") then you'd want to make a different plan. I think you're likely fine here, I just wanted to raise the question explicitly.

In T341122#9042792, @RLazarus wrote:

Those numbers don't immediately raise alarm bells for me -- "storage" doesn't mean anything persistent, only ephemeral data that can disappear when the script exits, right? As long as that's the case (and assuming you're using ~1 CPU), you should be fine. I'm tagging in @akosiaris to confirm the resource request is sensible.

Confirmed. Those numbers are perfectly sensible.

One other piece to consider, wrt ephemeral storage: you should expect it to go away even if the script doesn't complete successfully -- maybe it crashes, or maybe the scheduler needs to move it because the host machine is shutting down for maintenance, or whatever. That won't happen very often, but you want to make sure it's a non-disaster: assuming the answer is "we'll just rerun the script and it'll refresh everything" or even "we'll leave it until the next daily automatic run" that's fine. But if it were something like "we can't re-download that dump, we only stored it ephemerally, and if anything happens it'll be gone forever" (or even "we'll have committed a partial update and the inconsistent data is a problem") then you'd want to make a different plan. I think you're likely fine here, I just wanted to raise the question explicitly.

There is one extra thing that might complicate the above and it makes it extra important that the job is idempotent. Kubernetes nodes, when under stress, will "evict" all their workloads. "Evict" here is kubernetes terminology and means they will forcefully kill them, emptying themselves to avoid failure modes where they becoming unresponsive. The kubernetes platform becomes aware and will schedule workloads on other nodes (depending on a few things). I doubt memory and CPU usage will cause enough stress for eviction to start happening, we are anyway safeguarding against that, but we 've seen high disk usage do that in that past. All of this is to just add extra credence to Reuven's statement that you should treat the workload as ephemeral and idempotent. It might be killed at any point in time.

@Ladsgroup I went through the chain of tickets where we were discussing implementation and T305114: Set up MariaDB for iPoid in particular and I thought there was an explicit reason we couldn't hotswap between two databases but I can't find it so apologies but would this method work?

Have 2 databases, one accessible on prod and one not
Drop and write everything to the offline database
Swap out the prod/offline databases
Repeat daily

Writing up the diffing on a row by row basis, it seems less than optimal. It would have to check against 4 tables (actor_data, tunnels, behaviors, and risks) for each row to see if anything was updated. It's doable but it feels messy but maybe SELECTs are cheap enough that this is okay?

In T341122#9047472, @STran wrote:

@Ladsgroup I went through the chain of tickets where we were discussing implementation and T305114: Set up MariaDB for iPoid in particular and I thought there was an explicit reason we couldn't hotswap between two databases but I can't find it so apologies but would this method work?

Have 2 databases, one accessible on prod and one not

Drop and write everything to the offline database

Swap out the prod/offline databases

Repeat daily

Writing up the diffing on a row by row basis, it seems less than optimal. It would have to check against 4 tables (actor_data, tunnels, behaviors, and risks) for each row to see if anything was updated. It's doable but it feels messy but maybe SELECTs are cheap enough that this is okay?

The approach that I was testing in T341122#9042628 was to compare the data files for consecutive days directly. There's be a comparison pipeline, followed by a data import script. The comparison pipeline would go something like this:

Download the zipped data for yesterday and today
Unzip and sort the data files
Remove identical lines
Compare the remaining lines and translate into mysql statements, saved as a .sql file

Then for the update script, we'd slowly make the database updates by running the .sql file in batches and pausing, "so that the entire process spans at least 6 hours" (as requested by DBAs - see this task's description).

Thanks for the helpful information @RLazarus and @akosiaris. The storage space is only needed for the duration that the comparison script runs (it creates temporary files for sorting efficiently etc), and if something catastrophic happens it can re-download the data files and start again, or miss a day and wait until tomorrow.

For the update script, the .sql file would be stored while the database updates happen. That's comparable to the resources that would be needed for the approach outlined in this task description, so I didn't ask about that assuming that it had already been approved... But I can elaborate more if necessary.

tsepothoabala merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/35

Remove concentration skew/density data

Maintenance_bot removed a project: Patch-For-Review.Jul 28 2023, 8:30 PM

tchanders updated https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/20

Add a script for outputting diff stats between two daily data dumps

@STran I've uploaded a patch that diffs two data dumps relatively quickly. There's more work to do on it as outlined in the commit message. Feel free to take this over and work on top of it, as I'm out next week!

Tchanders mentioned this in T344272: Use batching for daily data import.Aug 15 2023, 4:56 PM

Tchanders moved this task from In Progress 💪 to Code Review 🔍 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.

• AGueyte mentioned this in T339325: "Duplicate entry '0' for key 'proxy'".Aug 16 2023, 4:27 PM

STran moved this task from Backlog to In Progress on the iPoid-Service board.Aug 17 2023, 9:50 AM

tchanders merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/20

Add a script for outputting diff stats between two daily data dumps

STran moved this task from Code Review 🔍 to QA/Testing 🐞 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.Aug 18 2023, 9:41 PM

Tchanders closed subtask T344520: Remove temporary files to avoid taking up unnecessary disk space as Resolved.Aug 23 2023, 1:50 PM

Tchanders added a subtask: T344272: Use batching for daily data import.Aug 23 2023, 1:54 PM

Tchanders added a subtask: T325630: Implement call to data vendor.

Tchanders edited parent tasks, added: T344749: Refactor import-db and update-db into one script; removed: T339284: Deploy ipoid.Aug 23 2023, 1:57 PM

Tchanders added a subtask: T344499: Move behaviors/proxies/tunnels processing into an independent script.Aug 23 2023, 2:10 PM

As discussed in retro meeting, QA are busy with other projects, and the update script is due to change as the open subtasks get completed. Moving to blocked/stalled until they are done, so that QA can test in one go.

Tchanders added a subtask: T345237: Investigate: Recovering from skipped updates when pulling from the daily feed.Aug 30 2023, 10:40 AM

STran mentioned this in T325630: Implement call to data vendor.Sep 6 2023, 1:59 AM

Tchanders closed subtask T344517: Fix statements generated for updated IP data as Resolved.Sep 6 2023, 5:00 PM

Tchanders closed subtask T344272: Use batching for daily data import as Resolved.

Tchanders closed subtask T325630: Implement call to data vendor as Resolved.

STran added a subtask: T345684: Implement batch tracking for imports.Sep 6 2023, 5:03 PM

Tchanders closed subtask T344921: Ensure dependent SQL statements are all run in the same batch as Resolved.Sep 6 2023, 5:05 PM

Tchanders closed subtask T344981: Map tables should not have null for any of their values as Resolved.

Tchanders removed a subtask: T345237: Investigate: Recovering from skipped updates when pulling from the daily feed.Sep 6 2023, 5:07 PM

Tchanders added a subtask: T341660: Investigate non-deterministic behaviour of import.Sep 6 2023, 5:10 PM

Tchanders closed subtask T344499: Move behaviors/proxies/tunnels processing into an independent script as Resolved.Sep 11 2023, 4:21 PM

Tchanders mentioned this in T344499: Move behaviors/proxies/tunnels processing into an independent script.