Page MenuHomePhabricator

Implement daily data update routine
Closed, ResolvedPublic

Description

In this task, we'll create the script for updating the ipoid database with $CURENT_DAY data, assuming that we already have a full set of data in the DB.

See T305724: Investigate database data invalidation questions and chunked/timed API to MySQL/MariaDB ETL and T325635: Investigate: Alternatives or improvements to data import method for previous discussion. Copying from T305724#8917809:

  1. Initial import, get the ~25m records into the DB (all INSERT queries). Tracked in T339331: Prepare for initial data import on production servers
  2. Daily ingest of data (cron job in Kubernetes):
    • Download dump file for today as $CURRENT_DAY_DUMP_FILE
    • Iterate over all lines in $CURRENT_DAY_DUMP_FILE and do a SELECT query against the DB.
      • If we find a result:
        • the metadata matches what is in $CURRENT_DAY_DUMP_FILE: do nothing, the data's current
        • the metadata doesn't match: UPDATE the record and associated metadata records as needed
      • If we don't find a result
        • It's a new record, run an INSERT to add to the DB
    • Download dump file for yesterday as $YESTERDAY_DUMP_FILE
    • Iterate over all lines in $YESTERDAY_DUMP_FILE, find lines that do not exist in $CURRENT_DAY_DUMP_FILE
      • Issue DELETE queries for all lines that are in $YESTERDAY_DUMP_FILE and not in $CURRENT_DAY_DUMP_FILE, as those are stale data

Note that the script processing the updates should add pauses to so that the entire process spans at least 6 hours. See T340516: Investigate diff expectations for updating imported data for details on expected number of new entries, updates, and deletions; tl;dr it's about 2M new entries, 6M updates, and 2M deletions.

Details

TitleReferenceAuthorSource BranchDest Branch
Import the behaviors, proxies and tunnels that are in the data filesrepos/mediawiki/services/ipoid!68tchandersget-propertiesmain
Remove concentration skew/density datarepos/mediawiki/services/ipoid!35stranremove-conc-datamain
Add a script for outputting diff stats between two daily data dumpsrepos/mediawiki/services/ipoid!20tchandersdiffingmain
Customize query in GitLab

Related Objects

Event Timeline

From T339331#8992755:

Given that we have the ability to provide as many resources as the pod needs, in which case, we can make it possible to load the whole dump in memory, if that would help with our current challenges.

@jijiki Does this only apply to the initial import, or could we do something similar for the daily updates - i.e. have the two dumps uncompressed in memory? I'm wondering about faster ways to compare the two dumps without reading from the database. We have a diffing script that works quickly, but uses a lot of storage space, and I'm wondering if it's at all worth pursuing.

@Tchanders we could potentially provide those resources, as long as they are for a limited amount of time, eg 2hrs.

In the case where we will be comparing two full dumps (downloaded and uncompressed during execution):

  • how much memory and storage would you need?
  • what would be the duration of this job?

@Tchanders we could potentially provide those resources, as long as they are for a limited amount of time, eg 2hrs.

In the case where we will be comparing two full dumps (downloaded and uncompressed during execution):

  • how much memory and storage would you need?
  • what would be the duration of this job?

Rough numbers from running a draft script locally:

  • RAM: a few GB (<5)
  • Storage: c.20GB
  • Duration: c.30 minutes

If these numbers are way out, we can probably bring some down at the expense of others (e.g. less memory, more time). I'd be interested to understand ballpark limits if possible. (And sorry if these are way, way out! I don't have much insight into what's available.)

The other approach we're looking at has much lower RAM and storage needs, but runs over several hours and uses millions of database reads.

Adding @RLazarus, since I believe @jijiki is out.

Those numbers don't immediately raise alarm bells for me -- "storage" doesn't mean anything persistent, only ephemeral data that can disappear when the script exits, right? As long as that's the case (and assuming you're using ~1 CPU), you should be fine. I'm tagging in @akosiaris to confirm the resource request is sensible.

One other piece to consider, wrt ephemeral storage: you should expect it to go away even if the script doesn't complete successfully -- maybe it crashes, or maybe the scheduler needs to move it because the host machine is shutting down for maintenance, or whatever. That won't happen very often, but you want to make sure it's a non-disaster: assuming the answer is "we'll just rerun the script and it'll refresh everything" or even "we'll leave it until the next daily automatic run" that's fine. But if it were something like "we can't re-download that dump, we only stored it ephemerally, and if anything happens it'll be gone forever" (or even "we'll have committed a partial update and the inconsistent data is a problem") then you'd want to make a different plan. I think you're likely fine here, I just wanted to raise the question explicitly.

Those numbers don't immediately raise alarm bells for me -- "storage" doesn't mean anything persistent, only ephemeral data that can disappear when the script exits, right? As long as that's the case (and assuming you're using ~1 CPU), you should be fine. I'm tagging in @akosiaris to confirm the resource request is sensible.

Confirmed. Those numbers are perfectly sensible.

One other piece to consider, wrt ephemeral storage: you should expect it to go away even if the script doesn't complete successfully -- maybe it crashes, or maybe the scheduler needs to move it because the host machine is shutting down for maintenance, or whatever. That won't happen very often, but you want to make sure it's a non-disaster: assuming the answer is "we'll just rerun the script and it'll refresh everything" or even "we'll leave it until the next daily automatic run" that's fine. But if it were something like "we can't re-download that dump, we only stored it ephemerally, and if anything happens it'll be gone forever" (or even "we'll have committed a partial update and the inconsistent data is a problem") then you'd want to make a different plan. I think you're likely fine here, I just wanted to raise the question explicitly.

There is one extra thing that might complicate the above and it makes it extra important that the job is idempotent. Kubernetes nodes, when under stress, will "evict" all their workloads. "Evict" here is kubernetes terminology and means they will forcefully kill them, emptying themselves to avoid failure modes where they becoming unresponsive. The kubernetes platform becomes aware and will schedule workloads on other nodes (depending on a few things). I doubt memory and CPU usage will cause enough stress for eviction to start happening, we are anyway safeguarding against that, but we 've seen high disk usage do that in that past. All of this is to just add extra credence to Reuven's statement that you should treat the workload as ephemeral and idempotent. It might be killed at any point in time.

@Ladsgroup I went through the chain of tickets where we were discussing implementation and T305114: Set up MariaDB for iPoid in particular and I thought there was an explicit reason we couldn't hotswap between two databases but I can't find it so apologies but would this method work?

  1. Have 2 databases, one accessible on prod and one not
  2. Drop and write everything to the offline database
  3. Swap out the prod/offline databases
  4. Repeat daily

Writing up the diffing on a row by row basis, it seems less than optimal. It would have to check against 4 tables (actor_data, tunnels, behaviors, and risks) for each row to see if anything was updated. It's doable but it feels messy but maybe SELECTs are cheap enough that this is okay?

@Ladsgroup I went through the chain of tickets where we were discussing implementation and T305114: Set up MariaDB for iPoid in particular and I thought there was an explicit reason we couldn't hotswap between two databases but I can't find it so apologies but would this method work?

  1. Have 2 databases, one accessible on prod and one not
  2. Drop and write everything to the offline database
  3. Swap out the prod/offline databases
  4. Repeat daily

Writing up the diffing on a row by row basis, it seems less than optimal. It would have to check against 4 tables (actor_data, tunnels, behaviors, and risks) for each row to see if anything was updated. It's doable but it feels messy but maybe SELECTs are cheap enough that this is okay?

The approach that I was testing in T341122#9042628 was to compare the data files for consecutive days directly. There's be a comparison pipeline, followed by a data import script. The comparison pipeline would go something like this:

  1. Download the zipped data for yesterday and today
  2. Unzip and sort the data files
  3. Remove identical lines
  4. Compare the remaining lines and translate into mysql statements, saved as a .sql file

Then for the update script, we'd slowly make the database updates by running the .sql file in batches and pausing, "so that the entire process spans at least 6 hours" (as requested by DBAs - see this task's description).

Thanks for the helpful information @RLazarus and @akosiaris. The storage space is only needed for the duration that the comparison script runs (it creates temporary files for sorting efficiently etc), and if something catastrophic happens it can re-download the data files and start again, or miss a day and wait until tomorrow.

For the update script, the .sql file would be stored while the database updates happen. That's comparable to the resources that would be needed for the approach outlined in this task description, so I didn't ask about that assuming that it had already been approved... But I can elaborate more if necessary.

tchanders updated https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/20

Add a script for outputting diff stats between two daily data dumps

@STran I've uploaded a patch that diffs two data dumps relatively quickly. There's more work to do on it as outlined in the commit message. Feel free to take this over and work on top of it, as I'm out next week!

As discussed in retro meeting, QA are busy with other projects, and the update script is due to change as the open subtasks get completed. Moving to blocked/stalled until they are done, so that QA can test in one go.

Ready for QA. We've tried to document this in the README. If it's not enough, please let us know and we'll improve it!

dom_walden subscribed.

More testing of this will be done as part of T348992. I will move this to Done.