Consider whether we can manage data for image-suggestions better
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Cparle
	Jul 7 2022, 4:59 PM

Description

ATM when we do a new run of the image-suggestions data pipeline we:

create a delta for the search indices
push timestamped data to cassandra

Consumers of image-suggestions data first query the search indices to get a list of articles-with-suggestions, and then query cassandra via the http gateway to get suggestions for individual articles. Results from cassandra are ordered by timestamp, so clients should use the results with the latest timestamps only (any article with no suggestions should not be contained in the search index, so outdated suggestions shouldn't matter)

This all works, but architecturally it feels pretty hacky:

we're storing one materialized view of the data in the search indices, and another in cassandra
we're preserving a bunch of outdated data in cassandra
the real Source of Truth is in Hive

Is there a better way?

My main source of unease is around managing data updates in Cassandra. We're doing it this way because we expect updates to be atomic - if there's a failure we won't end up with a partial dataset - but at the cost of extra processing in the client, extra storage, and making the whole system more difficult to reason about.

Perhaps it's worth investigating generating and applying a delta, similar to how we do it for the search indices?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T340437 [EPIC] Image suggestions data pipelines maintenance
		Open		None	T312571 Consider whether we can manage data for image-suggestions better

Event Timeline

Cparle created this task.Jul 7 2022, 4:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 7 2022, 4:59 PM

Cparle updated the task description. (Show Details)Jul 7 2022, 4:59 PM

EBernhardson subscribed.Jul 7 2022, 5:59 PM

CBogen moved this task from To Do to Data Pipeline on the Image-Suggestions board.Jul 7 2022, 6:35 PM

mfossati awarded a token.Jul 8 2022, 8:48 AM

mfossati subscribed.

Cparle updated the task description. (Show Details)Jul 8 2022, 11:20 AM

Tgr subscribed.Jul 9 2022, 3:17 PM

CBogen moved this task from Triage to Image Suggestions on the Structured-Data-Backlog board.Aug 29 2022, 4:37 PM

Cparle mentioned this in T313973: GrowthExperiments\NewcomerTasks\AddImage\ServiceImageRecommendationProvider::get Unable to decode JSON response for page {title} upstream connect error or disconnect/reset before headers. reset reason: connection termination.Sep 8 2022, 12:23 PM

My main source of unease is around managing data updates in Cassandra. We're doing it this way because we expect updates to be atomic - if there's a failure we won't end up with a partial dataset - but at the cost of extra processing in the client, extra storage, and making the whole system more difficult to reason about.

Other approaches we could take are:

Investigate mode='overwrite' instead of mode='append' when calling saveToCassandra() in HiveToCassandra.py

Would updating in this way be atomic? If not what happens in the case of failure part-way through?

What happens to the data behind the scenes? Does it all get deleted first, then re-added? If so does that mean that there will be an interval during which no data will be available to a client querying via the api gateway?

Are there failure modes we need to consider?

Investigate generating and applying a delta, similar to how we do it for the search indices

ATM HiveToCassandra.py only appends data - it can't generate updates or deletes - so either that would need to be modified or we'd need to use another way to push the data

Are there failure modes we need to consider?

We're considering using a relational db for storing section-image-suggestions data, perhaps we ought to consider a relational model here too?

Note that in this case someone would need to write a http api for accessing the data in the DB

To provide some prospective about why it is the way it is (for posterity sake and/or for those not present when we designed the model):

The way this has been conceived requires that for each new import we want to a) add suggestions for pages that had none prior, b) wholesale replace any suggestions for pages that did have prior results, and c) remove any previous suggestions that are no longer valid.

Assuming that correctness is important (I think it must be), then (b) is something that should be both atomic and isolated. Otherwise results could be wrong (not just stale) after unexpected import errors, or unfortunate query timing.

And (c) presents challenges as well. This is basically set differential and requires that we obtain a canonical list of keys for the current set, for purposes of comparison with the new one. Querying the entire dataset may be possible, but whether or not it's practical/advisable will depend on the size of the dataset, and how often we're relying on such techniques; This is too heavy-handed in my opinion to be a practice we're adopting for a platform.

The way it is currently modeled is basically MVCC. Each new data import is appended, and given a unique ID. These IDs are total ordered (they have a temporal component), and results are returned from the database in descending order. This guarantees that (b) above is atomic and isolated. If TTLs are utilized, then garbage collecting legacy versions is basically free, though that comes with the caveat that it relies upon the timing of imports (which is brittle). Use of batch operations that bundle a range DELETE with the INSERT, combined with TTLs as a fall back to address (c) seems reasonably elegant (given the requirements we're working with), even if it's not compatible with HiveToCassandra.

Cparle updated the task description. (Show Details)Jun 13 2023, 1:07 PM

mfossati added a parent task: T296814: [EPIC] Article-level image suggestions data pipeline.Jun 15 2023, 10:10 AM

mfossati added a project: Section-Level-Image-Suggestions.

mfossati added a parent task: T311814: [EPIC] Section-level image suggestions data pipeline.

AUgolnikova-WMF mentioned this in T340437: [EPIC] Image suggestions data pipelines maintenance .Jun 26 2023, 11:54 AM

AUgolnikova-WMF edited parent tasks, added: T340437: [EPIC] Image suggestions data pipelines maintenance ; removed: T311814: [EPIC] Section-level image suggestions data pipeline, T296814: [EPIC] Article-level image suggestions data pipeline.