==== **User Story**
> ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process
==== Success Criteria
- [ ] Schema stores all fields from output
- [ ] Supports retrieval of data set records by project & page ID
- [ ] Optionally(?) supports lookup/retrieval by project & page title
- [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data
==== Out of scope
- Storage of data for training
----
==== Cassandra Schema ====
IMPORTANT: Work in progress!
{P21420}
**NOTES:**
- The `items` table would be where recommendations are stored. As shown above it establishes a one-to-many relationship between a page, and an arbitrary number of recommendation IDs (a type 1 UUID that uniquely identifies each //import//, while doubling as a timestamp). Additionally, it establishes a one-to-many relationship between each recommendation ID, and the recommendations that are a part of that import.
- When importing data into `items`, we could either:
# Write with a TTL set to expire at some point after the next import is expected to happen. Reads would select all recommendations for a project & page ID, and late-filter past results (whether that late-filter would be encapsulated within a service (transparent), or performed by the clients, is TBD). Or...
# Remove during import using a `DELETE` of items with an `id` less-than the one being inserted. The `DELETE` and `INSERT` can be wrapped in a `BATCH` statement, making them atomic
- The `feedback` table simply keeps a record of user feedback (acceptance & rejection), with a copy of which attributes are needed. Assuming this table exists entirely to mask entries in `items`, the duplicated attributes should be limited to those necessary to accomplish that. User feedback is a one-to-many relationship between an image with feedback, and the generated feedback; An arbitrary number of accepts and/or rejects can be registered for a single image.
- Joins between `feedback` & `items` are not possible, so queries would need to be made against both tables (concurrently), and set difference performed to elide records in `items` as needed (whether this would be encapsulated within a service (made transparent), or performed by the clients, is TBD).
- As mentioned elsewhere, page titles are problematic because the relationship between them and page IDs is modeled in MediaWiki (MediaWiki is canonical). The need to //query by-title// title complicates matters. So, the `title_cache` table (so-named to (hopefully) set the appropriate expectation) serves as an inverted index - mapping titles to page IDs. This requires clients needing by-title access to first query the index/table (but is this an improvement over using the Action API?).
- `instance_of`, like `title` is duplicating relationships modeled by another system.
- The examples above variously include `page_rev` attributes that might prove useful in qualifying the datasets. YMMV.