Page MenuHomePhabricator

Investigate: Automatic handling of error handling and data drift in imports
Closed, ResolvedPublic

Description

Currently, we import feed data with the expectation that some data drift will happen over time and that we'll need to drop/re-import at some point. This drift occurs because our database has no way of knowing that its state has parity with the feeds it's downloading from and is a direct consequence of our decision to implement live data imports and updates on our database.

A few other solutions were proposed before we decided on this. Given what we know now and how finicky our current solution can be, let's revisit some of these alternative solutions and see if they're workable:

  • Offline diff daily (current implementation)
  • Full drop and import daily (switching over databases while one updates)
  • Row-by-row comparison of data in table vs data file daily

If they're not workable, let's note down the reason here. We've had this discussion across multiple phab tasks and afaik we've never come to a concise consensus on why we couldn't move forward with some of these solutions.


(A few errant notes)

Some pros:

  • A swap would mean we would always have data parity
  • initial imports take ~3 hours. Updates take ~2.5 hours

Some cons:

Event Timeline

kostajh updated the task description. (Show Details)
STran claimed this task.

Had a meeting with Amir and Thalia and came to the following conclusions:

  • Could do row by row comparison
    • need to batch the SELECT queries but otherwise, millions of reads won't be a problem
  • Can't export database for offline diffing
    • would essentially require SELECT *, causing the same resource pressure we're trying to avoid

tl;dr: At the moment, what we have isn't perfect but is still the best way forward we have. When the product is more mature, we may want to revisit row by row updates.

Adding more notes from the meeting notes doc which may be inaccessible to some reading this:

  • Need smaller transaction or it causes delays
  • Can’t fit that many writes per day in one database
  • Problem in general with Spur: a lot of data is being written (mariadb isn’t designed to handle this) - IOPS pressure
  • Expiry: can add an expiry but don’t index it. Iterate over all rows (in batches, w/index already?) and look for expiry.