Change Details

==== **User Story** > ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Recs process ==== Success Criteria [ ] Schema stores all fields from output [ ] Supports retrieval of data set records by page id [ ] Supports indexing of //matching records//¹ by a sequence (necessary for retrieval of pseudorandom results) [ ] Schema allows overwrite of existing records for new data, TTL expiry of stale data ¹ //articles with a non-zero number of recommended images.// Out of scope (for now): - Any additional fields that may be required for interacting with the data - Version history ---- ==== Cassandra Schema ==== IMPORTANT: Work in progress! {P17599} **NOTES:** -# Since we are after write semantics that will allow us to replace all of an articles recommendations at once (atomically, and isolated), this schema models the one-to-many relationship between an article and the recommended images using a map; Overwriting the `images` attribute will replace all previous recommendations with the new set -# This elides a separate attribute for timestamp in lieu of using a type 1 UUID for `dataset_id`. This does not prevent us from returning a separate timestamp in queries (ala: `SELECT dataset_id, cast(dataset_id as timestamp) as insertion_ts, ... FROM ...`) -# The proposed indexing by sequence number will be race-y unless ingestion is carefully orchestrated (which would be a disappointing precedent to set) # The service implements a multi-get style interface, what would amount to a `page_id IN (id,id, ...)` from storage, but what is proposed here only provides discrete access. Multi-get could be provided, but as Cassandra is distributed, it's papering over the fact that it's still many requests on the backend.