Page MenuHomePhabricator

Investigate diff expectations for updating imported data
Closed, ResolvedPublic

Description

If we can believe comm, then there are ~15M rows generated by comm -12 <(sort $YESTERDAY) <(sort $TODAY). That shows common lines between two sorted files. Which means there are something like ~10M rows to insert, update or delete. This seems far off the mark of the "10% churn" that someone from Spur cited, so maybe comm is wrong, or maybe "10% churn" refers to insert/deletes (so e.g. 2.5M) and then we'd have ~7.5M records to update.

Could we do something like:

  1. Initial import, get the ~22m records into the DB (all INSERT queries)
  2. Daily ingest of data (cron job in Kubernetes):
    • Download dump file for today as $CURRENT_DAY_DUMP_FILE
    • Iterate over all lines in $CURRENT_DAY_DUMP_FILE and do a SELECT query against the DB.
      • If we find a result:
        • the metadata matches what is in $CURRENT_DAY_DUMP_FILE: do nothing, the data's current
        • the metadata doesn't match: UPDATE the record
      • If we don't find a result
        • It's a new record, run an INSERT to add to the DB
    • Download dump file for yesterday as $YESTERDAY_DUMP_FILE
    • Iterate over all lines in $YESTERDAY_DUMP_FILE, find lines that do not exist in $CURRENT_DAY_DUMP_FILE
      • Issue DELETE queries for all lines that are in $YESTERDAY_DUMP_FILE and not in $CURRENT_DAY_DUMP_FILE as those are stale data

Chatted with @Tchanders today and we thought it would be useful to confirm numbers by spending some engineering hours writing this script to perform the expected operations and confirm or deny these numbers with an eye toward making it app-ready-ish. It would 1. give us more insight into the numbers and 2. if we moved forward with our import, batched or not, the app would have to do something like this anyway. We discussed that there was a concern of bikeshedding on this and if it does start to look like the solution isn't going to be in this direction, whoever's working on this should be ready to call it off.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
scripts: Add helpers for comparing datasetsrepos/mediawiki/services/ipoid!19kharlanT340516main
Draft: scripts: Add diff-data.jsrepos/mediawiki/services/ipoid!18kharlanT340516main
Customize query in GitLab

Event Timeline

Investigating to get clarity on this quantity of changes sounds like a good idea. I'll see if I can make some progress on this.

EDIT: I made a mistake in the script, reworking.

EDIT: I made a mistake in the script, reworking.

The mistake in the previous comment / script was in comparing randomly sampled data from current and previous, which was a nonsensical comparison. The idea to use a random sample to speed up the comparison work was interesting to explore, but in the end, I went for the slow-but-accurate method of iterating over every row in both datasets.

I've published an updated MR (https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/19). Here are my findings.

The high level scenario: assuming we have a database where we've imported all the data from yesterday, then we get a new file today, what volume of INSERT, UPDATE and DELETE statements are needed to have our database up-to-date with the latest information from today?

I started with a data file from June 25, with 25,798,701 rows and one from June 26, with 25,986,913. I used sort on both files. comm -12 (common lines between both files) reports 15,822,417 rows. This means that, from one day to the next, at a minimum, 60% of the entries are the same.

Unfortunately, Spur does not deterministically sort the fields in its JSON output, which means that you can have an entry like:

{"ip":"{omitted}","organization":"TOT Public Company Limited","as":{"number":23969,"organization":"TOT Public Company Limited"},"client":{"count":2,"types":["MOBILE","DESKTOP"],"behaviors":["TOR_PROXY_USER"],"proxies":["OXYLABS_PROXY","AWM_PROXY","LUMINATI_PROXY"],"concentration":{"geohash":"w4rwcn","city":"Bang Khen","state":"Bangkok","country":"TH","skew":20,"density":0.5},"countries":1},"tunnels":[{"type":"VPN","anonymous":false}],"services":["PPTP"],"location":{"city":"Bangkok","state":"Bangkok","country":"TH"},"risks":["CALLBACK_PROXY"]}

on June 25, and the entry for the same IP could look like this on June 26:

{"ip":"{omitted}","organization":"TOT Public Company Limited","as":{"number":23969,"organization":"TOT Public Company Limited"},"client":{"count":2,"types":["DESKTOP","MOBILE"],"behaviors":["TOR_PROXY_USER"],"proxies":["OXYLABS_PROXY","LUMINATI_PROXY","AWM_PROXY"],"concentration":{"geohash":"w4rwcn","city":"Bang Khen","state":"Bangkok","country":"TH","skew":20,"density":0.5},"countries":1},"tunnels":[{"type":"VPN","anonymous":false}],"services":["PPTP"],"location":{"city":"Bangkok","state":"Bangkok","country":"TH"},"risks":["CALLBACK_PROXY"]}

So, the metadata is the same, but in one entry, we have ["MOBILE","DESKTOP"] and in the other we have ["DESKTOP","MOBILE"].

That means that comm -12 is a minimum guess, the actual percentage of JSON blobs that match across both files could be higher; we need a comparison function to check if the metadata for an IP is equivalent from yesterday to today. I used a function to sort all properties within an object, then used JSON.stringify() to check for equivalence.

To compare the datasets, I set up a MariaDB database called diffdata. In this database, I created five tables:

  • current
  • previous
  • report_updated
  • report_unchanged
  • report_deleted
  • report_insert

Each table contains an identical structure:

CREATE TABLE current( data JSON, CHECK (data IS NULL OR JSON_VALID(data)) );
ALTER TABLE current ADD COLUMN ip VARCHAR(128) AS (JSON_VALUE(data, '$.ip')) VIRTUAL;
CREATE INDEX ip ON current(ip);

I used a script to iterate over the JSON lines in the source data files, and loaded each row for June 25 into the previous table, and each row for June 26 into the current table. This takes about 1:15 hours for each file, on my laptop. I recommend running the script serially, rather than in parallel, to avoid overwhelming your MariaDB instance.

(Side note: of the multiple ways I've tried to work with these large files, MariaDB and a JSON type column has been the fastest and most pleasant to work with.)

Then I used two scripts to do the following:

  1. [compare-data.js] Iterate over all rows from the June 25 data file. For each row, ask the following questions:
    • Does the row IP exist in the current dataset? If it does not exist, that means that June 26 doesn't contain an IP used on June 25. Which means this would be a DELETE query. Add this as a row to the report_delete table.
    • The row IP exists, now, recursively sort the fields in the object, and use JSON.stringify() to compare against the recursively sorted object from current. If the objects are the same, add a new entry to report_unchanged. Otherwise, add a row to report_update.
    • n.b. this script took 1:25 minutes to run on my laptop
  2. [calculate-inserts.js] Iterate over all rows from the June 26 data file. For each row:
    • check if its IP address is present in the previous table. If it's not present, that means that an IP address appeared on June 26 that wasn't known on June 25. That would be an INSERT for our app, so add a row to the report_insert table.
    • n.b. this script took 35 minutes to run on my laptop.

Note that RAM usage for both scripts is pretty low (<200mb).

At the end of both scripts, I ran use SELECT COUNT(*) FROM {table} to get some reporting statistics:

Query typeCountType of query
Number of IP addresses that are newly seen on June 262,494,599INSERT
Number of IP addresses that have updated data from June 25 to June 265,899,916UPDATE
Number of IP addresses that were seen on June 25 but not on June 262,306,387DELETE
Number of IP addresses that have identical metadata on June 25 and June 2617,592,398N/A

So, to update the database for ipoid each day, we would be looking at a minimum of 2.5m INSERT, 5.8m UPDATE and 2.3m DELETE statements. But that is assuming that we only perform one DB update per row in the source data file. Based on our current schema each row can require multiple INSERT/UPDATE statements, so with our current plans, the number of INSERT/UPDATE/DELETE statements would be some multiple of the figures in the above table.

Next step is to talk this through with @Tchanders, @STran and @Ladsgroup (or someone else from DBA).

tchanders opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/20

Draft: Add a script for outputting diff stats between two daily data dumps

@kostajh I started playing around with your original bash scripting and made a small pipeline to compare the whole dumps.

It runs in about 20 minutes so I was able to check a couple of other days too, and found similar results to what you found:

$ sh diff.sh ../tmp/20230625.json ../tmp/20230626.json 20230625_20230626.diff
21:26:00
Sorting data files...
21:30:30
Diffing...
21:39:41
Processing diff file...
21:42:43
Sorting diff file...
21:44:13
Calculating diff stats...
{ changed: 7664784, removed: 2311500, inserted: 2499712 }
21:45:52

$ sh diff.sh ../tmp/20230626.json ../tmp/20230627.json 20230626_20230627.diff
22:29:24
Sorting data files...
22:33:26
Diffing...
22:44:55
Processing diff file...
22:47:51
Sorting diff file...
22:49:07
Calculating diff stats...
{ changed: 7710882, removed: 2110535, inserted: 2437433 }
22:50:44

$ sh diff.sh ../tmp/20230627.json ../tmp/20230628.json 20230627_20230628.diff
10:41:33
Sorting data files...
10:46:22
Diffing...
10:54:30
Processing diff file...
10:57:44
Sorting diff file...
10:59:25
Calculating diff stats...
{ changed: 8024716, removed: 2089604, inserted: 2146587 }
11:01:13

(The numbers are probably slightly different because this script doesn't check for swaps like ["MOBILE","DESKTOP"] vs ["DESKTOP","MOBILE"].)

My next thought was how much we care about the changed data - e.g. in some instances the only difference was that client.concentration.skew had changed very slightly.

But when I re-ran only looking at the changes to client.proxies, it still found c. 6.5 million changes, so it looks like most of these changes are real and useful. (I guess we could also check whether any of those were just changes in the order of the array, but @kostajh's analysis suggests most of the differences would be real.)

@Tchanders, @Ladsgroup and I met yesterday (notes) to talk through the numbers above.

To recap, between one day and the next, we have approximately:

  • 2 million IP addresses to drop
  • 2.5 million IP addresses (+ metadata) to insert
  • 6 million IP addresses for which we need to update metadata

@Ladsgroup confirmed that this volume of INSERT/UPDATE/DELETE queries is fine, but requested that we spread out the workload over a period of 6 hours.

@Tchanders and I then discussed briefly what our workflow for updates might look like. In general, it sounds like the proposal in T305724#8917809 would work fine. The last step, Iterate over all lines in $YESTERDAY_DUMP_FILE, find lines that do not exist in $CURRENT_DAY_DUMP_FILE is going to be slow, but it sounds like that would work to our advantage given the need to not do a bunch of DELETE queries all at once. (We could speed up the process of getting rows to DELETE by using @Tchanders's script, but, that uses sort which balloons to > 30 GB of RAM when I run it locally, so I doubt that will work well in the production Kubernetes job.)

@Tchanders and I then discussed briefly what our workflow for updates might look like. In general, it sounds like the proposal in T305724#8917809 would work fine. The last step, Iterate over all lines in $YESTERDAY_DUMP_FILE, find lines that do not exist in $CURRENT_DAY_DUMP_FILE is going to be slow, but it sounds like that would work to our advantage given the need to not do a bunch of DELETE queries all at once. (We could speed up the process of getting rows to DELETE by using @Tchanders's script, but, that uses sort which balloons to > 30 GB of RAM when I run it locally, so I doubt that will work well in the production Kubernetes job.)

I've implemented a more RAM-efficient sort if that helps - latest version of the MR.