Background
Spur provides data files with tens of millions of lines daily.
We're currently diffing yesterday's and today's data files each day. This is very quick, since most entries do not change from one day to the next, and finding identical lines in flat files is quick. (Currently the whole update takes 2-3 hours, of which most of the time is spent updating the database, not calculating the diff.)
However, we update the DB assuming that it perfectly matches yesterday's data. If that assumption is wrong, we can get drift over time (e.g. outdated rows that never get deleted). This would happen due to errors in an update causing an incomplete update. To try and avoid this, we disallow future updates until the problem is fixed, but in that case the service would become stuck on one day until we intervened.
We have discussed diffing the latest data file against the current database instead, but although this approach would be allowed (we are allowed to SELECT every row of the database if we do it in batches - see T353440#9422354), it is likely to be much slower. This is because we would need to SELECT and transform each row for comparison in order to discover whether lines are identical.
(Note that we can't reimport all the data each day, or download the database for comparison: T353440#9422354.)
Possible solution
We could instead write a maintenance script that SELECTs each row of the DB and compares it to today's file, and makes fixes where it finds discrepancies. If we run it after an update, we would expect to find nothing to correct, usually.
As long as the script could complete well within the time between the update finishing and the next update starting, it wouldn't matter if it was slow. (If we happened to find that it runs a lot quicker than expected, we could even replace the daily diff with this.)
How often would we need to run this?
- Drift should only happen after errors in a daily update, so we could just run this whenever an update fails and isn't fixed by T344941: Respond to data update and import errors
- In case of something we haven't accounted for, e.g. some kind of silent errors, we could run this periodically (e.g. once a week/month)
How urgent is this?
On the one hand, not terribly urgent since ipoid doesn't seem to be encountering update errors frequently (hasn't encountered any since going live about a month ago).
On the other hand, the service becoming stuck on one day until we intervene seems quite bad.