Currently, we import feed data with the expectation that some data drift will happen over time and that we'll need to drop/re-import at some point. This drift occurs because our database has no way of knowing that its state has parity with the feeds it's downloading from and is a direct consequence of our decision to implement live data imports and updates on our database.
A few other solutions were proposed before we decided on this. Given what we know now and how finicky our current solution can be, let's revisit some of these alternative solutions and see if they're workable:
- Offline diff daily (current implementation)
- Full drop and import daily (switching over databases while one updates)
- Row-by-row comparison of data in table vs data file daily
If they're not workable, let's note down the reason here. We've had this discussion across multiple phab tasks and afaik we've never come to a concise consensus on why we couldn't move forward with some of these solutions.
(A few errant notes)
Some pros:
- A swap would mean we would always have data parity
- initial imports take ~3 hours. Updates take ~2.5 hours
Some cons:
- We haven't solved the historical data problem (T351922: Allow querying historical information on IPs)