Chatted with @Tchanders today and we thought it would be useful to confirm numbers by spending some engineering hours writing this script to perform the expected operations and confirm or deny these numbers with an eye toward making it app-ready-ish. It would 1. give us more insight into the numbers and 2. if we moved forward with our import, batched or not, the app would have to do something like this anyway. We discussed that there was a concern of bikeshedding on this and if it does start to look like the solution isn't going to be in this direction, whoever's working on this should be ready to call it off.
Description
Details
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| scripts: Add helpers for comparing datasets | repos/mediawiki/services/ipoid!19 | kharlan | T340516 | main | |
| Draft: scripts: Add diff-data.js | repos/mediawiki/services/ipoid!18 | kharlan | T340516 | main |
Related Objects
Event Timeline
Investigating to get clarity on this quantity of changes sounds like a good idea. I'll see if I can make some progress on this.
kharlan opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/18
scripts: Add diff-data.js
kharlan closed https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/18
Draft: scripts: Add diff-data.js
kharlan opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/19
scripts: Add helpers for comparing datasets
The mistake in the previous comment / script was in comparing randomly sampled data from current and previous, which was a nonsensical comparison. The idea to use a random sample to speed up the comparison work was interesting to explore, but in the end, I went for the slow-but-accurate method of iterating over every row in both datasets.
I've published an updated MR (https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/19). Here are my findings.
The high level scenario: assuming we have a database where we've imported all the data from yesterday, then we get a new file today, what volume of INSERT, UPDATE and DELETE statements are needed to have our database up-to-date with the latest information from today?
I started with a data file from June 25, with 25,798,701 rows and one from June 26, with 25,986,913. I used sort on both files. comm -12 (common lines between both files) reports 15,822,417 rows. This means that, from one day to the next, at a minimum, 60% of the entries are the same.
Unfortunately, Spur does not deterministically sort the fields in its JSON output, which means that you can have an entry like:
{"ip":"{omitted}","organization":"TOT Public Company Limited","as":{"number":23969,"organization":"TOT Public Company Limited"},"client":{"count":2,"types":["MOBILE","DESKTOP"],"behaviors":["TOR_PROXY_USER"],"proxies":["OXYLABS_PROXY","AWM_PROXY","LUMINATI_PROXY"],"concentration":{"geohash":"w4rwcn","city":"Bang Khen","state":"Bangkok","country":"TH","skew":20,"density":0.5},"countries":1},"tunnels":[{"type":"VPN","anonymous":false}],"services":["PPTP"],"location":{"city":"Bangkok","state":"Bangkok","country":"TH"},"risks":["CALLBACK_PROXY"]}
on June 25, and the entry for the same IP could look like this on June 26:
{"ip":"{omitted}","organization":"TOT Public Company Limited","as":{"number":23969,"organization":"TOT Public Company Limited"},"client":{"count":2,"types":["DESKTOP","MOBILE"],"behaviors":["TOR_PROXY_USER"],"proxies":["OXYLABS_PROXY","LUMINATI_PROXY","AWM_PROXY"],"concentration":{"geohash":"w4rwcn","city":"Bang Khen","state":"Bangkok","country":"TH","skew":20,"density":0.5},"countries":1},"tunnels":[{"type":"VPN","anonymous":false}],"services":["PPTP"],"location":{"city":"Bangkok","state":"Bangkok","country":"TH"},"risks":["CALLBACK_PROXY"]}
So, the metadata is the same, but in one entry, we have ["MOBILE","DESKTOP"] and in the other we have ["DESKTOP","MOBILE"].
That means that comm -12 is a minimum guess, the actual percentage of JSON blobs that match across both files could be higher; we need a comparison function to check if the metadata for an IP is equivalent from yesterday to today. I used a function to sort all properties within an object, then used JSON.stringify() to check for equivalence.
To compare the datasets, I set up a MariaDB database called diffdata. In this database, I created five tables:
- current
- previous
- report_updated
- report_unchanged
- report_deleted
- report_insert
Each table contains an identical structure:
CREATE TABLE current( data JSON, CHECK (data IS NULL OR JSON_VALID(data)) ); ALTER TABLE current ADD COLUMN ip VARCHAR(128) AS (JSON_VALUE(data, '$.ip')) VIRTUAL; CREATE INDEX ip ON current(ip);
I used a script to iterate over the JSON lines in the source data files, and loaded each row for June 25 into the previous table, and each row for June 26 into the current table. This takes about 1:15 hours for each file, on my laptop. I recommend running the script serially, rather than in parallel, to avoid overwhelming your MariaDB instance.
(Side note: of the multiple ways I've tried to work with these large files, MariaDB and a JSON type column has been the fastest and most pleasant to work with.)
Then I used two scripts to do the following:
- [compare-data.js] Iterate over all rows from the June 25 data file. For each row, ask the following questions:
- Does the row IP exist in the current dataset? If it does not exist, that means that June 26 doesn't contain an IP used on June 25. Which means this would be a DELETE query. Add this as a row to the report_delete table.
- The row IP exists, now, recursively sort the fields in the object, and use JSON.stringify() to compare against the recursively sorted object from current. If the objects are the same, add a new entry to report_unchanged. Otherwise, add a row to report_update.
- n.b. this script took 1:25 minutes to run on my laptop
- [calculate-inserts.js] Iterate over all rows from the June 26 data file. For each row:
- check if its IP address is present in the previous table. If it's not present, that means that an IP address appeared on June 26 that wasn't known on June 25. That would be an INSERT for our app, so add a row to the report_insert table.
- n.b. this script took 35 minutes to run on my laptop.
Note that RAM usage for both scripts is pretty low (<200mb).
At the end of both scripts, I ran use SELECT COUNT(*) FROM {table} to get some reporting statistics:
| Query type | Count | Type of query |
| Number of IP addresses that are newly seen on June 26 | 2,494,599 | INSERT |
| Number of IP addresses that have updated data from June 25 to June 26 | 5,899,916 | UPDATE |
| Number of IP addresses that were seen on June 25 but not on June 26 | 2,306,387 | DELETE |
| Number of IP addresses that have identical metadata on June 25 and June 26 | 17,592,398 | N/A |
So, to update the database for ipoid each day, we would be looking at a minimum of 2.5m INSERT, 5.8m UPDATE and 2.3m DELETE statements. But that is assuming that we only perform one DB update per row in the source data file. Based on our current schema each row can require multiple INSERT/UPDATE statements, so with our current plans, the number of INSERT/UPDATE/DELETE statements would be some multiple of the figures in the above table.
Next step is to talk this through with @Tchanders, @STran and @Ladsgroup (or someone else from DBA).
tchanders opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/20
Draft: Add a script for outputting diff stats between two daily data dumps
@kostajh I started playing around with your original bash scripting and made a small pipeline to compare the whole dumps.
It runs in about 20 minutes so I was able to check a couple of other days too, and found similar results to what you found:
$ sh diff.sh ../tmp/20230625.json ../tmp/20230626.json 20230625_20230626.diff 21:26:00 Sorting data files... 21:30:30 Diffing... 21:39:41 Processing diff file... 21:42:43 Sorting diff file... 21:44:13 Calculating diff stats... { changed: 7664784, removed: 2311500, inserted: 2499712 } 21:45:52 $ sh diff.sh ../tmp/20230626.json ../tmp/20230627.json 20230626_20230627.diff 22:29:24 Sorting data files... 22:33:26 Diffing... 22:44:55 Processing diff file... 22:47:51 Sorting diff file... 22:49:07 Calculating diff stats... { changed: 7710882, removed: 2110535, inserted: 2437433 } 22:50:44 $ sh diff.sh ../tmp/20230627.json ../tmp/20230628.json 20230627_20230628.diff 10:41:33 Sorting data files... 10:46:22 Diffing... 10:54:30 Processing diff file... 10:57:44 Sorting diff file... 10:59:25 Calculating diff stats... { changed: 8024716, removed: 2089604, inserted: 2146587 } 11:01:13
(The numbers are probably slightly different because this script doesn't check for swaps like ["MOBILE","DESKTOP"] vs ["DESKTOP","MOBILE"].)
My next thought was how much we care about the changed data - e.g. in some instances the only difference was that client.concentration.skew had changed very slightly.
But when I re-ran only looking at the changes to client.proxies, it still found c. 6.5 million changes, so it looks like most of these changes are real and useful. (I guess we could also check whether any of those were just changes in the order of the array, but @kostajh's analysis suggests most of the differences would be real.)
@Tchanders, @Ladsgroup and I met yesterday (notes) to talk through the numbers above.
To recap, between one day and the next, we have approximately:
- 2 million IP addresses to drop
- 2.5 million IP addresses (+ metadata) to insert
- 6 million IP addresses for which we need to update metadata
@Ladsgroup confirmed that this volume of INSERT/UPDATE/DELETE queries is fine, but requested that we spread out the workload over a period of 6 hours.
@Tchanders and I then discussed briefly what our workflow for updates might look like. In general, it sounds like the proposal in T305724#8917809 would work fine. The last step, Iterate over all lines in $YESTERDAY_DUMP_FILE, find lines that do not exist in $CURRENT_DAY_DUMP_FILE is going to be slow, but it sounds like that would work to our advantage given the need to not do a bunch of DELETE queries all at once. (We could speed up the process of getting rows to DELETE by using @Tchanders's script, but, that uses sort which balloons to > 30 GB of RAM when I run it locally, so I doubt that will work well in the production Kubernetes job.)
I've implemented a more RAM-efficient sort if that helps - latest version of the MR.
kharlan closed https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/19
scripts: Add helpers for comparing datasets