Page MenuHomePhabricator

Investigate database data invalidation questions and chunked/timed API to MySQL/MariaDB ETL
Closed, ResolvedPublic

Description

Depending upon which vendor product is chosen (particularly a "daily" feed option), there will be a need to implement the following functionality for the API to MySQL/MariaDB ETL:

  1. A mechanism to chunk a data insert/update over a period of time (see: T305114#7824528 and related comments)
  2. A mechanism to properly invalidate historical data after a given time period (e.g. data over X time units old should be flushed from the db) T325642: Implement data expiry
  3. Define and implement any fallback mechanisms if API or database becomes unavailable.

Event Timeline

  1. A mechanism to chunk a data insert/update over a period of time (see: T305114#7824528 and related comments)

To expand on this, after T325642: Implement data expiry we'll be:

  1. expecting to update using the whole data dump
  2. if an IP is not yet in the database, inserting the row
  3. if an IP is already in the database, deleting the existing row and inserting a new row (even if the data are the same)
  4. deleting any row that wasn't replaced (expiring)

According to @Ladsgroup in T305114#7824528 this will be too much in one go:

Yup, a massive change like that would choke the whole cluster. The update must be gradual, e.g. in course of an hour. Probably you can also do differential update? Update only what has changed since the last update?

Possibilities:

  • Can we work from a diff file at all? I believe not, since it appears the diff feed doesn't contain all the data we need
  • Instead of doing a REPLACE in (3), we could SELECT the existing row and compare first, and do nothing if the data for that IP are unchanged. Presumably a load of reads from replicas wouldn't be a problem. But would the number of rows left to insert still be too much?
  • Find some way of slowing this process down
  1. A mechanism to chunk a data insert/update over a period of time (see: T305114#7824528 and related comments)

To expand on this, after T325642: Implement data expiry we'll be:

  1. expecting to update using the whole data dump
  2. if an IP is not yet in the database, inserting the row
  3. if an IP is already in the database, deleting the existing row and inserting a new row (even if the data are the same)
  4. deleting any row that wasn't replaced (expiring)

According to @Ladsgroup in T305114#7824528 this will be too much in one go:

Yup, a massive change like that would choke the whole cluster. The update must be gradual, e.g. in course of an hour. Probably you can also do differential update? Update only what has changed since the last update?

Possibilities:

  • Can we work from a diff file at all? I believe not, since it appears the diff feed doesn't contain all the data we need
  • Instead of doing a REPLACE in (3), we could SELECT the existing row and compare first, and do nothing if the data for that IP are unchanged. Presumably a load of reads from replicas wouldn't be a problem. But would the number of rows left to insert still be too much?

Right, if we use an UPDATE instead of DELETE/INSERT or REPLACE INTO, does that make this entire set of transactions easier for the MariaDB server to handle?

we could SELECT the existing row and compare first, and do nothing if the data for that IP are unchanged

I believe we always need to update the last_updated column, so that we can keep track of which items need expiry.

  • Find some way of slowing this process down

We could consider multiple runs per day, where each run processes rows related to a specific attribute in the data dump that we think more or less evenly divides the number of records to process. e.g. using different known values from the infrastructure attribute or the tag attribute. So run one processes records where infrastructure = DATACENTER, another run processes records where infrastructure = {something else}, etc.

It'd be nice to avoid this additional complexity, though.

  1. A mechanism to chunk a data insert/update over a period of time (see: T305114#7824528 and related comments)

To expand on this, after T325642: Implement data expiry we'll be:

  1. expecting to update using the whole data dump
  2. if an IP is not yet in the database, inserting the row
  3. if an IP is already in the database, deleting the existing row and inserting a new row (even if the data are the same)
  4. deleting any row that wasn't replaced (expiring)

According to @Ladsgroup in T305114#7824528 this will be too much in one go:

Yup, a massive change like that would choke the whole cluster. The update must be gradual, e.g. in course of an hour. Probably you can also do differential update? Update only what has changed since the last update?

Possibilities:

  • Can we work from a diff file at all? I believe not, since it appears the diff feed doesn't contain all the data we need
  • Instead of doing a REPLACE in (3), we could SELECT the existing row and compare first, and do nothing if the data for that IP are unchanged. Presumably a load of reads from replicas wouldn't be a problem. But would the number of rows left to insert still be too much?

Right, if we use an UPDATE instead of DELETE/INSERT or REPLACE INTO, does that make this entire set of transactions easier for the MariaDB server to handle?

we could SELECT the existing row and compare first, and do nothing if the data for that IP are unchanged

I believe we always need to update the last_updated column, so that we can keep track of which items need expiry.

  • Find some way of slowing this process down

We could consider multiple runs per day, where each run processes rows related to a specific attribute in the data dump that we think more or less evenly divides the number of records to process. e.g. using different known values from the infrastructure attribute or the tag attribute. So run one processes records where infrastructure = DATACENTER, another run processes records where infrastructure = {something else}, etc.

It'd be nice to avoid this additional complexity, though.

I was chatting with @Ladsgroup about the limits of what we could do with INSERT/DELETE/UPDATE. It sounds like running 20,000,000 UPDATE queries isn't going to work well.

Could we do something like:

  1. Initial import, get the ~22m records into the DB (all INSERT queries)
  2. Daily ingest of data (cron job in Kubernetes):
    • Download dump file for today as $CURRENT_DAY_DUMP_FILE
    • Iterate over all lines in $CURRENT_DAY_DUMP_FILE and do a SELECT query against the DB.
      • If we find a result:
        • the metadata matches what is in $CURRENT_DAY_DUMP_FILE: do nothing, the data's current
        • the metadata doesn't match: UPDATE the record
      • If we don't find a result
        • It's a new record, run an INSERT to add to the DB
    • Download dump file for yesterday as $YESTERDAY_DUMP_FILE
    • Iterate over all lines in $YESTERDAY_DUMP_FILE, find lines that do not exist in $CURRENT_DAY_DUMP_FILE
      • Issue DELETE queries for all lines that are in $YESTERDAY_DUMP_FILE and not in $CURRENT_DAY_DUMP_FILE as those are stale data

If the ~10% churn in daily data that Spur cited is accurate, then we'd be looking at ~2 million INSERT/UPDATE/DELETE statements per run.

Yes. On top of that you can do this in batches of 10,000 rows, find the diff, insert and update needed rows, sleep for 2 seconds, move on to the next batch.

To my understanding (please correct me if I am wrong) we have some tolerance regarding having stale data. If that statement is true, and give that the offered diff by the provider is missing some groups of data, my suggestion is the following:

Assumptions:

  • We have imported all 22Mil records
  • We can tolerate some amount of stale data (and have discussed what this means)
  • We keep track of when we fetched the last diff, and if this was successful
  • The diff is of a manageable size

Cronjobs:

  • Have a daily cronjob (6 days a week) fetching the diff, and UPDATE/DELETE/INSERT affected rows
  • Have a weekly cronjob (during a window where the DB servers are less busy) to do a full DB update as described in T305724#8917809

After the API is in production, we can do an analysis on how frequently the users requested data from the "stale group", so to evaluate how we impacting our business needs.

Why do this? We will be saving on computer resources (and energy) both on our kubernetes and mysql clusters, while we avoid having long running jobs running, and not worry about replication lags

  1. Initial import, get the ~22m records into the DB (all INSERT queries)
  2. Daily ingest of data (cron job in Kubernetes):
    • Download dump file for today as $CURRENT_DAY_DUMP_FILE
    • Iterate over all lines in $CURRENT_DAY_DUMP_FILE and do a SELECT query against the DB.
      • If we find a result:
        • the metadata matches what is in $CURRENT_DAY_DUMP_FILE: do nothing, the data's current
        • the metadata doesn't match: UPDATE the record
      • If we don't find a result
        • It's a new record, run an INSERT to add to the DB
    • Download dump file for yesterday as $YESTERDAY_DUMP_FILE
    • Iterate over all lines in $YESTERDAY_DUMP_FILE, find lines that do not exist in $CURRENT_DAY_DUMP_FILE
      • Issue DELETE queries for all lines that are in $YESTERDAY_DUMP_FILE and not in $CURRENT_DAY_DUMP_FILE as those are stale data

If the ~10% churn in daily data that Spur cited is accurate, then we'd be looking at ~2 million INSERT/UPDATE/DELETE statements per run.

It would be lovely if we work into getting an estimation on how long this run will take

To my understanding (please correct me if I am wrong) we have some tolerance regarding having stale data. If that statement is true, and give that the offered diff by the provider is missing some groups of data, my suggestion is the following:

Assumptions:

  • We have imported all 22Mil records
  • We can tolerate some amount of stale data (and have discussed what this means)
  • We keep track of when we fetched the last diff, and if this was successful
  • the
  • The diff is of a manageable size

Cronjobs:

  • Have a daily cronjob (6 days a week) fetching the diff, and UPDATE/DELETE/INSERT affected rows
  • Have a weekly cronjob (during a window where the DB servers are less busy) to do a full DB update as described in T305724#8917809

After the API is in production, we can do an analysis on how frequently the users requested data from the "stale group", so to evaluate how we impacting our business needs.

Why do this? We will be saving on computer resources (and energy) both on our kubernetes and mysql clusters, while we avoid having long running jobs running

  1. Initial import, get the ~22m records into the DB (all INSERT queries)
  2. Daily ingest of data (cron job in Kubernetes):
    • Download dump file for today as $CURRENT_DAY_DUMP_FILE
    • Iterate over all lines in $CURRENT_DAY_DUMP_FILE and do a SELECT query against the DB.
      • If we find a result:
        • the metadata matches what is in $CURRENT_DAY_DUMP_FILE: do nothing, the data's current
        • the metadata doesn't match: UPDATE the record
      • If we don't find a result
        • It's a new record, run an INSERT to add to the DB
    • Download dump file for yesterday as $YESTERDAY_DUMP_FILE
    • Iterate over all lines in $YESTERDAY_DUMP_FILE, find lines that do not exist in $CURRENT_DAY_DUMP_FILE
      • Issue DELETE queries for all lines that are in $YESTERDAY_DUMP_FILE and not in $CURRENT_DAY_DUMP_FILE as those are stale data

If the ~10% churn in daily data that Spur cited is accurate, then we'd be looking at ~2 million INSERT/UPDATE/DELETE statements per run.

It would be lovely if we work into getting an estimation on how long this run will take

@jijiki to clarify, what do you mean by diff, do you mean the https://exports.spur.us/v1/feeds/anonymous?diff=true (docs)? AIUI, that doesn't have the necessary data that we need.

Or by diff do you mean the process I sketched in T305724#8917809?

@jijiki to clarify, what do you mean by diff, do you mean the https://exports.spur.us/v1/feeds/anonymous?diff=true (docs)? AIUI, that doesn't have the necessary data that we need.

Or by diff do you mean the process I sketched in T305724#8917809?

I mean the https://exports.spur.us/v1/feeds/anonymous?diff=true, use this daily, and the process T305724#8917809 weekly. For that specific subset of the dataset. that is missing from the daily diff, our data will be at most, a week old. What do you think?

Unfortunately, the data missing from the export diff at this time is likely to be the data most relevant and interesting to the Foundation and Community.

Unfortunately, the data missing from the export diff at this time is likely to be the data most relevant and interesting to the Foundation and Community.

In my proposal, the missing dataset will be there, but it be refreshed weekly instead of daily. What I trying to understand is, if we have information as to how frequently we are accessing data from this dataset, as well as, how often and how much this dataset changes. For example:

  • If 1 out 500 searches matches data found in that dataset, possibly the impact is minimal.
  • if the daily changes of that dataset are very very few compared to its size, again, possibly the impact is minimal.

Lastly, do we have an estimation of the size of the dataset?

@jijiki to clarify, what do you mean by diff, do you mean the https://exports.spur.us/v1/feeds/anonymous?diff=true (docs)? AIUI, that doesn't have the necessary data that we need.

Or by diff do you mean the process I sketched in T305724#8917809?

I mean the https://exports.spur.us/v1/feeds/anonymous?diff=true, use this daily, and the process T305724#8917809 weekly. For that specific subset of the dataset. that is missing from the daily diff, our data will be at most, a week old. What do you think?

There is another option we can consider. We could run a daily cron to process the diff, and a daily cron to extract only the missing dataset from tha full dump, and IMPORT/UPDATE/DELETE using the process in T305724#8917809? The overall processing time for both in total, must be considerably shorter.

Later on we can consider having a monthly job using T305724#8917809 and do a full consolidation

In my proposal, the missing dataset will be there, but it be refreshed weekly instead of daily.

Sure, I'm just saying that it's likely that this data would be the most relevant to Foundation and Community interests, and thus expected (or hoped to be) the most frequently updated. I don't think it would be a dealbreaker if it cannot be, just a reduced expectation.

What I trying to understand is, if we have information as to how frequently we are accessing data from this dataset, as well as, how often and how much this dataset changes. For example:

  • If 1 out 500 searches matches data found in that dataset, possibly the impact is minimal.
  • if the daily changes of that dataset are very very few compared to its size, again, possibly the impact is minimal.

I don't think we have any data or assumptions around frequency of access just yet, but I believe AHT had some numbers around likely daily diff volume.

Lastly, do we have an estimation of the size of the dataset?

Uncompressed, the entire daily feed has recently been around 8 Gb.

Uncompressed, the entire daily feed has recently been around 8 Gb.

My bad, I meant the size of the specific dataset that is not included in the daily diff, I was generally referring to the residential proxy dataset

Edit: going to do some more investigation and present a more coherent summary, to avoid unnecessary confusion :)

If no one is opposed, it sounds like some work can move forward in implementing T305724#8917809 since it sounds like we agree that something like that has to happen, we're just not sure how frequently it should happen?


As for the diff, I did a small test to compare what the diff would give us vs checking against yesterday's file. I pulled down the following:

curl -L -O -H "Token: $TOKEN" "https://feeds.spur.us/v2/anonymous-residential/latest.json.gz"
curl -L -O -H "Token: $TOKEN" "https://feeds.spur.us/v2/anonymous-residential/20230619/feed.json.gz"
curl -H "Token: $TOKEN" "https://exports.spur.us/v1/feeds/anonymous?diff=true"

The diff file is a newline-delineated list of subs/add instructions and I pulled one to compare against the feeds:

sub/add,ip,tag,infrastructure,services,clientCount,locationCity,locationState,locationCountry
sub,<IP>,OPEN_PROXY,,[HTTP],,Petrozavodsk,Karelia,RU

I grepped for that IP in the latest file and yesterday's file and found these (omitting IP, org, and as):

grep "<IP>" ./feed.json

{"ip":"<IP>","client":{},
"tunnels":[{"operator":"OPEN_PROXY","type":"PROXY","anonymous":true}],
"services":["HTTP"],
"location":{"city":"Petrozavodsk","state":"Karelia","country":"RU"},
"risks":["TUNNEL"]}
grep "<IP>" ./latest.json

{"ip":"<IP>","client":{},
"tunnels":[{"type":"VPN","anonymous":false}],
"services":["HTTP"],
"location":{"city":"Petrozavodsk","state":"Karelia","country":"RU"}}

It looks like the operations are more granular than we expected so I'm not sure how useful this diff would be. In this instance, for example, we'd have to know to update the tunnel and the risk states and how to update them (the tunnel is now considered a VPN) whereas if we used just the feed files we'd do a wholesale update and let the feed decide the final state of the row.

I also looked at the number of entries across the files:

wc -l ./diff.txt 
   74133 ./diff.txt

wc -l ./feed.json
 26071716 ./feed.json

wc -l ./latest.json
 26205621 ./latest.json

It seems like there were 74k "diffs" (updates only?) and a +134k change in the entity count. I can't say how important updating the specific details of an IP are since I don't know the heuristics being used by functionaries/bots but it would be nice to keep up with new additions as much as we could which is to say it would be nice if we could check against the new feed every day.

If no one is opposed, it sounds like some work can move forward in implementing T305724#8917809 since it sounds like we agree that something like that has to happen, we're just not sure how frequently it should happen?


As for the diff, I did a small test to compare what the diff would give us vs checking against yesterday's file. I pulled down the following:

curl -L -O -H "Token: $TOKEN" "https://feeds.spur.us/v2/anonymous-residential/latest.json.gz"
curl -L -O -H "Token: $TOKEN" "https://feeds.spur.us/v2/anonymous-residential/20230619/feed.json.gz"
curl -H "Token: $TOKEN" "https://exports.spur.us/v1/feeds/anonymous?diff=true"

The diff file is a newline-delineated list of subs/add instructions and I pulled one to compare against the feeds:

sub/add,ip,tag,infrastructure,services,clientCount,locationCity,locationState,locationCountry
sub,<IP>,OPEN_PROXY,,[HTTP],,Petrozavodsk,Karelia,RU

I grepped for that IP in the latest file and yesterday's file and found these (omitting IP, org, and as):

grep "<IP>" ./feed.json

{"ip":"<IP>","client":{},
"tunnels":[{"operator":"OPEN_PROXY","type":"PROXY","anonymous":true}],
"services":["HTTP"],
"location":{"city":"Petrozavodsk","state":"Karelia","country":"RU"},
"risks":["TUNNEL"]}
grep "<IP>" ./latest.json

{"ip":"<IP>","client":{},
"tunnels":[{"type":"VPN","anonymous":false}],
"services":["HTTP"],
"location":{"city":"Petrozavodsk","state":"Karelia","country":"RU"}}

It looks like the operations are more granular than we expected so I'm not sure how useful this diff would be. In this instance, for example, we'd have to know to update the tunnel and the risk states and how to update them (the tunnel is now considered a VPN) whereas if we used just the feed files we'd do a wholesale update and let the feed decide the final state of the row.

Right, I also agree that the diff feed doesn't seem useful for our purposes.

I also looked at the number of entries across the files:

wc -l ./diff.txt 
   74133 ./diff.txt

wc -l ./feed.json
 26071716 ./feed.json

wc -l ./latest.json
 26205621 ./latest.json

It seems like there were 74k "diffs" (updates only?) and a +134k change in the entity count. I can't say how important updating the specific details of an IP are since I don't know the heuristics being used by functionaries/bots but it would be nice to keep up with new additions as much as we could which is to say it would be nice if we could check against the new feed every day.

If we can believe comm, then there are ~15M rows generated by comm -12 <(sort $YESTERDAY) <(sort $TODAY). That shows common lines between two sorted files. Which means there are something like ~10M rows to insert, update or delete. This seems far off the mark of the "10% churn" that someone from Spur cited, so maybe comm is wrong, or maybe "10% churn" refers to insert/deletes (so e.g. 2.5M) and then we'd have ~7.5M records to update.

If we're looking at ~10M update/delete/inserts each day, that seems difficult to manage.

Some things I've been thinking about:

  • doing the update process weekly instead of daily, using an "active" database and an "inactive" one. Each week, drop the database and run a full import on the inactive database, then switch it to active.
  • considering if a key/value store like Redis/Memcached (or even MySQL with a simple key/value table) would potentially handle this volume better, especially if the only product use case (that I know of right now) is looking up data for a single IP address

I posted some notes in T340516: Investigate diff expectations for updating imported data (T340516#8987218). tl;dr the proposal in T305724#8917809 is workable from DBA side. I'll update the task description.

kostajh claimed this task.

I posted some notes in T340516: Investigate diff expectations for updating imported data (T340516#8987218). tl;dr the proposal in T305724#8917809 is workable from DBA side. I'll update the task description.

Closing this in favor of T341122: Implement daily data update routine. (Please re-open if you think we should discuss this further!)