Ensure database is up-to-date with latest data file
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Tchanders
	Jan 12 2024, 12:21 PM

Description

Background

Spur provides data files with tens of millions of lines daily.

We're currently diffing yesterday's and today's data files each day. This is very quick, since most entries do not change from one day to the next, and finding identical lines in flat files is quick. (Currently the whole update takes 2-3 hours, of which most of the time is spent updating the database, not calculating the diff.)

However, we update the DB assuming that it perfectly matches yesterday's data. If that assumption is wrong, we can get drift over time (e.g. outdated rows that never get deleted). This would happen due to errors in an update causing an incomplete update. To try and avoid this, we disallow future updates until the problem is fixed, but in that case the service would become stuck on one day until we intervened.

We have discussed diffing the latest data file against the current database instead, but although this approach would be allowed (we are allowed to SELECT every row of the database if we do it in batches - see T353440#9422354), it is likely to be much slower. This is because we would need to SELECT and transform each row for comparison in order to discover whether lines are identical.

(Note that we can't reimport all the data each day, or download the database for comparison: T353440#9422354.)

Possible solution

We could instead write a maintenance script that SELECTs each row of the DB and compares it to today's file, and makes fixes where it finds discrepancies. If we run it after an update, we would expect to find nothing to correct, usually.

As long as the script could complete well within the time between the update finishing and the next update starting, it wouldn't matter if it was slow. (If we happened to find that it runs a lot quicker than expected, we could even replace the daily diff with this.)

How often would we need to run this?

Drift should only happen after errors in a daily update, so we could just run this whenever an update fails and isn't fixed by T344941: Respond to data update and import errors
In case of something we haven't accounted for, e.g. some kind of silent errors, we could run this periodically (e.g. once a week/month)

How urgent is this?

On the one hand, not terribly urgent since ipoid doesn't seem to be encountering update errors frequently (hasn't encountered any since going live about a month ago).

On the other hand, the service becoming stuck on one day until we intervene seems quite bad.

Related Objects

Mentioned In: T356736: Duplicate entry 'X.X.X.X' for key 'ip' sql: INSERT INTO actor_data (ip,org,client_count,types,conc_city,conc_state,conc_country,countries,location_country,risks
T355246: Respond to 17.01.2024 data import failure
Mentioned Here: T351922: Allow querying historical information on IPs
T356736: Duplicate entry 'X.X.X.X' for key 'ip' sql: INSERT INTO actor_data (ip,org,client_count,types,conc_city,conc_state,conc_country,countries,location_country,risks
T344941: Respond to data update and import errors
T353440: Investigate: Automatic handling of error handling and data drift in imports

Event Timeline

Tchanders created this task.Jan 12 2024, 12:21 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 12 2024, 12:21 PM

We could instead write a maintenance script that SELECTs each row of the DB and compares it to today's file, and makes fixes where it finds discrepancies. If we run it after an update, we would expect to find nothing to correct, usually.

Should we also handle the reverse case, where entries from the latest source file didn't make it into the DB? The maintenance script would need to iterate over each line in the file, and do a SELECT query to check that it exists in the DB.

As long as the script could complete well within the time between the update finishing and the next update starting, it wouldn't matter if it was slow.

Maybe we could just run it as a matter of course, directly after the daily import, as part of the same cronjob. That has the benefit that Kubernetes knows it should not schedule another concurrent instance of that job.

Overall, I like this approach and see it as a worthwhile investment. The alternative is that we discover a problem, and have to drop the DB and do a full reimport, which will take several hours of computer time and a few hours of people time to coordinate, communicate, etc.

In T354944#9456127, @kostajh wrote:

We could instead write a maintenance script that SELECTs each row of the DB and compares it to today's file, and makes fixes where it finds discrepancies. If we run it after an update, we would expect to find nothing to correct, usually.

Should we also handle the reverse case, where entries from the latest source file didn't make it into the DB? The maintenance script would need to iterate over each line in the file, and do a SELECT query to check that it exists in the DB.

Yes - good point.

kostajh mentioned this in T355246: Respond to 17.01.2024 data import failure.Jan 17 2024, 4:54 PM

kostajh mentioned this in T356736: Duplicate entry 'X.X.X.X' for key 'ip' sql: INSERT INTO actor_data (ip,org,client_count,types,conc_city,conc_state,conc_country,countries,location_country,risks.Feb 6 2024, 8:17 AM

Just jotting down some thoughts about how we could do this.

Neat-looking approach

We could add a new mode to output-sql.js handlers along these lines:

fixDatabase: function ( line ) {
	const data = JSON.parse( line );
	// Is IP in DB?
		// No: insert it with (generateInsertActorQueries)
		// Yes: is data the same as in DB?
			// No: update row with new data (generateUpdateActorQueries)
			// Yes: update last_updated (generateUpdateActorQueries)
}

We'd need a new function to create an actor object from a DB row
We could have a separate script to delete all rows with last_updated older than MAX_AGE
MAX_AGE could be configurable, and this could solve T351922: Allow querying historical information on IPs

Caveat
This all looks nice and neat, but it does mean making 30-odd million updates every time we run this, which is something we were advised not to do: T353440#9518844

Alternatives

The problematic part is updating last_updated, meaning we are now updating rows for every unchanged IP. This is how we're able to find removed IPs.

Solution 1: We run fixDatabase infrequently (as long as DBAs say it's OK to do all these updates in one day, once in a while). I think this wouldn't be too bad, since we don't actually expect the database to break anyway. It's currently broken due to T356736 which was caused by a bad fundamental assumption, which feels like a rare event to me.
Solution 2: Instead of updating last_updated, we find removed IPs by SELECTing all the IPs in the database (in manageable batches) then searching the data file for them. How do we search the file though?

In T354944#9518882, @Tchanders wrote:

Solution 2: Instead of updating last_updated, we find removed IPs by SELECTing all the IPs in the database (in manageable batches) then searching the data file for them. How do we search the file though?

I believe we have enough RAM available (either now, or if we request it) such that we could iterate over each line of the file and store the IP (and maybe also the associated data) in a variable. Alternatively, I was wondering if we could iterate over each line in the file and write the contents to a SQLite file--that would use a lot less RAM, and I believe lookups would be much faster. But I'd lean towards exploring the "load everything into memory" option first.

Ensure database is up-to-date with latest data fileOpen, Needs TriagePublicActions