==== **User Story**
> ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process
==== Success Criteria
- [ ] Schema stores all fields from output
- [ ] Supports retrieval of data set records by project & page ID
- [ ] Optionally(?) supports lookup/retrieval by project & page title
- [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data
==== Out of scope
- Storage of data for training
----
==== Cassandra Schema ====
IMPORTANT: Work in progress!
{P17599}{P21420}
**NOTES:**
# Since we are after write semantics that will allow us to replace all of an articles recommendations at once (atomically- The `items` table would be where recommendations are stored. As shown above it establishes a one-to-many relationship between a page, and an arbitrary number of recommendation IDs (a type 1 UUID that uniquely identifies each //import//, while doubling as a timestamp). Additionally, it establishes a one-to-many relationship between each recommendation ID, and the recommendations that are a part of that import.
- When importing data into `items`, we could either:
# Write with a TTL set to expire at some point after the next import is expected to happen. Reads would select all recommendations for a project & page ID, and late-filter past results (whether that late-filter would be encapsulated within a service (transparent), or performed by the clients, is TBD). Or...
# Remove during import using a `DELETE` of items with an `id` less-than the one being inserted. The `DELETE` and `INSERT` can be wrapped in a `BATCH` statement, making them atomic
- The `feedback` table simply keeps a record of user feedback (acceptance & rejection), with a copy of which attributes are needed. Assuming this table exists entirely to mask entries in `items`, the duplicated attributes should be limited to those necessary to accomplish that.
- Joins between `feedback` & `items` are not possible, so queries would need to be made against both tables (concurrently), and set difference performed to elide records in `items` as needed (whether this would be encapsulated within a service (made transparent), or performed by the clients, and isolated)is TBD).
- As mentioned elsewhere, this schema models the one-to-many relationship between an article and the recommended images using a map;page titles are problematic because the relationship between them and page IDs is modeled in MediaWiki (MediaWiki is canonical). The need to //query by-title// title complicates matters. So, Overwriting the `images` attribute will replace all previous recommendations with the new setthe `title_cache` table (so-named to (hopefully) set the appropriate expectation) serves as an inverted index - mapping titles to page IDs. This requires clients needing by-title access to first query the index/table (but is this an improvement over using the Action API?).
# This elides a separate attribute for timestamp in lieu of using a type 1 UUID for `dataset_id`. This does not prevent us from returning a separate timestamp in queries (ala: `SELECT dataset_id, cast(dataset_id as timestamp) as insertion_ts, ...- `instance_of`, FROM ...`)like `title` is duplicating relationships modeled by another system.
# The experimental service implements a multi-get style interface, what would amount to a `page_id IN (id,id, ...)` from storage, but what is proposed here only provides discrete access. Multi-get could be provided, but as Cassandra is distributed,- The examples above variously include `page_rev` attributes that might prove useful in qualifying the datasets. it's papering over the fact that it's still many requests on the backendYMMV.