Page MenuHomePhabricator

Consider whether we can manage data for image-suggestions better
Open, Needs TriagePublic

Description

ATM when we do a new run of the image-suggestions data pipeline we:

  • create a delta for the search indices
  • push timestamped data to cassandra

Consumers of image-suggestions data first query the search indices to get a list of articles-with-suggestions, and then query cassandra via the http gateway to get suggestions for individual articles. Results from cassandra are ordered by timestamp, so clients should use the results with the latest timestamps only (any article with no suggestions should not be contained in the search index, so outdated suggestions shouldn't matter)

This all works, but architecturally it feels pretty hacky:

  • we're storing one materialized view of the data in the search indices, and another in cassandra
  • we're preserving a bunch of outdated data in cassandra
  • the real Source of Truth is in Hive

Is there a better way?

My main source of unease is around managing data updates in Cassandra. We're doing it this way because we expect updates to be atomic - if there's a failure we won't end up with a partial dataset - but at the cost of extra processing in the client, extra storage, and making the whole system more difficult to reason about.

Perhaps it's worth investigating generating and applying a delta, similar to how we do it for the search indices?

Event Timeline

My main source of unease is around managing data updates in Cassandra. We're doing it this way because we expect updates to be atomic - if there's a failure we won't end up with a partial dataset - but at the cost of extra processing in the client, extra storage, and making the whole system more difficult to reason about.

Other approaches we could take are:

  • Investigate mode='overwrite' instead of mode='append' when calling saveToCassandra() in HiveToCassandra.py
    • Would updating in this way be atomic? If not what happens in the case of failure part-way through?
    • What happens to the data behind the scenes? Does it all get deleted first, then re-added? If so does that mean that there will be an interval during which no data will be available to a client querying via the api gateway?
    • Are there failure modes we need to consider?
  • Investigate generating and applying a delta, similar to how we do it for the search indices
    • ATM HiveToCassandra.py only appends data - it can't generate updates or deletes - so either that would need to be modified or we'd need to use another way to push the data
    • Are there failure modes we need to consider?
  • We're considering using a relational db for storing section-image-suggestions data, perhaps we ought to consider a relational model here too?
    • Note that in this case someone would need to write a http api for accessing the data in the DB

To provide some prospective about why it is the way it is (for posterity sake and/or for those not present when we designed the model):

The way this has been conceived requires that for each new import we want to a) add suggestions for pages that had none prior, b) wholesale replace any suggestions for pages that did have prior results, and c) remove any previous suggestions that are no longer valid.

Assuming that correctness is important (I think it must be), then (b) is something that should be both atomic and isolated. Otherwise results could be wrong (not just stale) after unexpected import errors, or unfortunate query timing.

And (c) presents challenges as well. This is basically set differential and requires that we obtain a canonical list of keys for the current set, for purposes of comparison with the new one. Querying the entire dataset may be possible, but whether or not it's practical/advisable will depend on the size of the dataset, and how often we're relying on such techniques; This is too heavy-handed in my opinion to be a practice we're adopting for a platform.

The way it is currently modeled is basically MVCC. Each new data import is appended, and given a unique ID. These IDs are total ordered (they have a temporal component), and results are returned from the database in descending order. This guarantees that (b) above is atomic and isolated. If TTLs are utilized, then garbage collecting legacy versions is basically free, though that comes with the caveat that it relies upon the timing of imports (which is brittle). Use of batch operations that bundle a range DELETE with the INSERT, combined with TTLs as a fall back to address (c) seems reasonably elegant (given the requirements we're working with), even if it's not compatible with HiveToCassandra.