Investigate: Automatic handling of error handling and data drift in imports
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	STran
	Dec 14 2023, 12:03 PM

Description

Currently, we import feed data with the expectation that some data drift will happen over time and that we'll need to drop/re-import at some point. This drift occurs because our database has no way of knowing that its state has parity with the feeds it's downloading from and is a direct consequence of our decision to implement live data imports and updates on our database.

A few other solutions were proposed before we decided on this. Given what we know now and how finicky our current solution can be, let's revisit some of these alternative solutions and see if they're workable:

Offline diff daily (current implementation)
Full drop and import daily (switching over databases while one updates)
Row-by-row comparison of data in table vs data file daily

If they're not workable, let's note down the reason here. We've had this discussion across multiple phab tasks and afaik we've never come to a concise consensus on why we couldn't move forward with some of these solutions.

(A few errant notes)

Some pros:

A swap would mean we would always have data parity
initial imports take ~3 hours. Updates take ~2.5 hours

Some cons:

We haven't solved the historical data problem (T351922: Allow querying historical information on IPs)

Related Objects

Mentioned In: T354944: Ensure database is up-to-date with latest data file
Mentioned Here: T351922: Allow querying historical information on IPs

Event Timeline

STran created this task.Dec 14 2023, 12:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 14 2023, 12:03 PM

Tchanders added subscribers: kostajh, Tchanders.Dec 15 2023, 1:09 PM

kostajh updated the task description. (Show Details)Dec 18 2023, 1:09 PM

kostajh updated the task description. (Show Details)

Had a meeting with Amir and Thalia and came to the following conclusions:

Could do row by row comparison
- need to batch the SELECT queries but otherwise, millions of reads won't be a problem

can't do active/passive setup:
- would require a cluster (too expensive)
- or would require us to vastly increase the amount of queries to the current cluster (in order to drop the passive db in prep for import)
- we're already pushing the limits of the cluster as-is (https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1176&var-datasource=thanos&var-cluster=mysql&from=now-7d&to=now&refresh=5m)

Can't export database for offline diffing
- would essentially require SELECT *, causing the same resource pressure we're trying to avoid

tl;dr: At the moment, what we have isn't perfect but is still the best way forward we have. When the product is more mature, we may want to revisit row by row updates.

Tchanders mentioned this in T354944: Ensure database is up-to-date with latest data file.Jan 12 2024, 12:21 PM

STran moved this task from Backlog to Done on the iPoid-Service board.Jan 23 2024, 12:02 PM

Adding more notes from the meeting notes doc which may be inaccessible to some reading this:

Need smaller transaction or it causes delays
Can’t fit that many writes per day in one database
Problem in general with Spur: a lot of data is being written (mariadb isn’t designed to handle this) - IOPS pressure
Expiry: can add an expiry but don’t index it. Iterate over all rows (in batches, w/index already?) and look for expiry.

Investigate: Automatic handling of error handling and data drift in importsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Investigate: Automatic handling of error handling and data drift in imports
Closed, ResolvedPublic
Actions