==== **User Story**
> ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Recs process
==== Success Criteria
[ ] Schema stores all fields from output
[ ] Supports retrieval of data set records by page id
[ ] Supports indexing of //matching records//¹ by a sequence (necessary for retrieval of pseudorandom results)
[ ] Schema allows overwrite of existing records for new data, TTL expiry of stale data
¹ //articles with a non-zero number of recommended images.//
Out of scope (for now):
- Any additional fields that may be required for interacting with the data
- Version history
----
==== Cassandra Schema ====
IMPORTANT: Work in progress!
{P17599}
**NOTES:**
# Since we are after write semantics that will allow us to replace all of an articles recommendations at once (atomically, and isolated), this schema models the one-to-many relationship between an article and the recommended images using a map; Overwriting the `images` attribute will replace all previous recommendations with the new set
# This elides a separate attribute for timestamp in lieu of using a type 1 UUID for `dataset_id`. This does not prevent us from returning a separate timestamp in queries (ala: `SELECT dataset_id, cast(dataset_id as timestamp) as insertion_ts, ... FROM ...`)
# The proposed indexing by sequence number will be race-y unless ingestion is carefully orchestrated (which would be a disappointing precedent to set)
# The service implements a multi-get style interface, what would amount to a `page_id IN (id,id, ...)` from storage, but what is proposed here only provides discrete access. Multi-get could be provided, but as Cassandra is distributed, it's papering over the fact that it's still many requests on the backend.