ATM when we do a new run of the image-suggestions data pipeline we:
- create a delta for the search indices
- push timestamped data to cassandra
Consumers of image-suggestions data first query the search indices to get a list of articles-with-suggestions, and then query cassandra via the http gateway to get suggestions for individual articles. Results from cassandra are ordered by timestamp, so clients should use the results with the latest timestamps only (any article with no suggestions should not be contained in the search index, so outdated suggestions shouldn't matter)
This all works, but architecturally it feels pretty hacky:
- we're storing one materialized view of the data in the search indices, and another in cassandra
- we're preserving a bunch of outdated data in cassandra
- the real Source of Truth is in Hive
Is there a better way?
My main source of unease is around managing data updates in Cassandra. We're doing it this way because we expect updates to be atomic - if there's a failure we won't end up with a partial dataset - but at the cost of extra processing in the client, extra storage, and making the whole system more difficult to reason about.
Perhaps it's worth investigating generating and applying a delta, similar to how we do it for the search indices?