==== **User Story**
> ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process
==== Success Criteria
- [ ] Schema stores all fields from output
- [ ] Supports retrieval of data set records by project & page ID
- [ ] Optionally(?) supports lookup/retrieval by project & page title
- [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data
==== Out of scope
- Storage of data for training
----
==== Cassandra Storage ====
{P21420}
===== Proposal =====
- From a Product perspective, the `suggestions` table is The Dataset (i.e could appear in a catalog of published datasets for reuse). It establishes a one-to-many relationship between a page (identified by the `(wiki,page_id)` tuple), and an arbitrary number of suggestion IDs ([[ https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_1_(date-time_and_MAC_address) | a type-1 UUID ]]). As this dataset is the product of a batch analytics job, and generated periodically, the suggestion ID (`suggetions.id`) corresponds to each batch run. There is a one-to-many relationship between suggestion IDs, the images suggested, and the attributes that correspond to each.
- The `feedback` table keeps a record of user-supplied feedback for image suggestions. It is considered application state for Structured Data & Growth's use-cases, and not a part of the image suggestions dataset.
- The `title_cache` and `instanceof_cache` tables store attribute relationships that are canonically modeled in other systems (MediaWikis), and are only maintained here for convenience.
- Retention of data in the `suggestions` table will be managed by TTLs. The length of the TTL will be a multiple of the update frequency that provides some historical results (and a buffer against late/missing batch jobs), while keeping result sets bounded for performance. Data in the other tables is maintained in perpetuity, and/or overwritten as required
- Since joins between these tables are not possible, multiple queries will be needed in some scenarios. For example, if //relevant// image suggestions are those without feedback, then separate queries of `suggestions` & `feedback` will need to be performed (can be performed concurrently), and set difference performed. Lookups by table name will first require a query against `title_cache` to find the page ID, and then followed up with a query to `suggestions`.