Change Details

==== **User Story** > ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process ==== Success Criteria - [ ] Schema stores all fields from output - [ ] Supports retrieval of data set records by project & page ID - [ ] Optionally(?) supports lookup/retrieval by project & page title - [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data ==== Out of scope - Storage of data for training ---- ==== Cassandra Schematorage ==== IMPORTANT: Work in progress! {P21420} **NOTES:**===== Proposal ===== - The `suggestions` table would be where recommendations are stored. As shown above it establishes a one-to-many relationship between a page, and an arbitrary number of recommendation IDs ([[ https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_1_(date-time_and_MAC_address) | a type-1 UUID ]] that uniquely identifies each //import//, while doubling as a timestamp). Additionally, it establishes a one-to-many relationship between each recommendation ID, and the recommendations that are a part of that import (keyed/sorted by the image name). - When importing data into `suggestions`From a Product perspective, wthe `suggestions` table is The Dataset (i.e could do any of: appear in a catalog of published datasets for reuse). # Write with a TTL set to expire at some point after the next import is expected to happen. Reads would select all recommendations for a project & page IDIt establishes a one-to-many relationship between a page (identified by the `(wiki,page_id)` tuple), and late-filter any (unexpired) past results. Whether this late-filtering would be encapsulated within a service (i.e. made transparent)an arbitrary number of suggestion IDs ([[ https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_1_(date-time_and_MAC_address) | a type-1 UUID ]]). As this dataset is the product of a batch analytics job, or performed by the clientsand generated periodically, is TBD. Or... # Past entries could be removed during import using a `DELETE` of items with an `id` less-than the one being insertedthe suggestion ID (`suggetions.id`) corresponds to each batch run. The `DELETE` and `INSERT` can be wrapped in a `BATCH` statementThere is a one-to-many relationship between suggestion IDs, making them atomic. The caveat of course is that this only works to delete the past entries of items //currently// being updated/inserted). Or... # Issue a `TRUNCATE` and drop the entire contents of the table (all recommendations will be lost until the import repopulates)the images suggested, and the attributes that correspond to each. - The `feedback` table simply keeps a record of user feedback (acceptance & rejection), with a copy of attributes which are needed. Assuming this table exists entirely to mask entries in `suggestions`, the duplicated attributes should be limited to only those necessary to accomplish that-supplied feedback for image suggestions. User feedbackIt is a one-to-many relconsidered applicationship between an image with feedback, and the generated feedback; state for Structured Data & Growth's use-cases, An arbitrary number of accepts and/or rejects can be registered for a single imageand not a part of the image suggestions dataset. - Joins between `feedback` & `suggestions` are not possible, so queries would need to be made against both tables (concurrently), and set difference performed to elide records in `suggestions` as needed. Whether this would be encapsulated within a service (made transparent), or performed by the clientsThe `title_cache` and `instanceof_cache` tables store attribute relationships that are canonically modeled in other systems (MediaWikis), is TBDand are only maintained here for convenience. - As mentioned elsewhere, page titles are problematic because the relationship between them and page IDs is modeled in MediaWiki (MediaWiki is canonical). The need to //query by-title// complicates matterRetention of data in the `suggestions` table will be managed by TTLs. SoThe length of the TTL will be a multiple of the update frequency that provides some historical results (and a buffer against late/missing batch jobs), the `title_cache` table (so-named to (hopefully) set the appropriate expectation) serves as an inverted index - mapping titles to page IDswhile keeping result sets bounded for performance. This requires clients needing by-title access to first query the index/table (but is this an improvement over using the Action API?).Data in the other tables is maintained in perpetuity, and/or overwritten as required - `instance_of`Since joins between these tables are not possible, multiple queries will be needed in some scenarios. For example, if //relevant// image suggestions are those without feedback, then separate queries of `suggestions` & `feedback` will need to be performed (can be performed concurrently), and set difference performed. Lookups by table name will first require a query against `title_cache` to find the page ID, like `title` is duplicating relationships modeled by another systemand then followed up with a query to `suggestions`. - The examples above variously include `page_rev` attributes that might prove useful in qualifying the datasets. YMMV.