Change Details

==== **User Story** > ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process ==== Success Criteria - [ ] Schema stores all fields from output - [ ] Supports retrieval of data set records by project & page ID - [ ] Optionally(?) supports lookup/retrieval by project & page title - [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data ==== Out of scope - Storage of data for training ---- ==== Cassandra Schema ==== IMPORTANT: Work in progress! {P21420} **NOTES:** - The `items` table would be where recommendations are stored. As shown above it establishes a one-to-many relationship between a page, and an arbitrary number of recommendation IDs (a type 1 UUID that uniquely identifies each //import//, while doubling as a timestamp). Additionally, it establishes a one-to-many relationship between each recommendation ID, and the recommendations that are a part of that import. - When importing data into `items`, we could either: # Write with a TTL set to expire at some point after the next import is expected to happen. Reads would select all recommendations for a project & page ID, and late-filter past results (whether that late-filter would be encapsulated within a service (transparent), or performed by the clients, is TBD). Or... # Remove during import using a `DELETE` of items with an `id` less-than the one being inserted. The `DELETE` and `INSERT` can be wrapped in a `BATCH` statement, making them atomic - The `feedback` table simply keeps a record of user feedback (acceptance & rejection), with a copy of which attributes are needed. Assuming this table exists entirely to mask entries in `items`, the duplicated attributes should be limited to those necessary to accomplish that. - Joins between `feedback` & `items` are not possible, so queries would need to be made against both tables (concurrently), and set difference performed to elide records in `items` as needed (whether this would be encapsulated within a service (made transparent), or performed by the clients, is TBD). - As mentioned elsewhere, page titles are problematic because the relationship between them and page IDs is modeled in MediaWiki (MediaWiki is canonical). The need to //query by-title// title complicates matters. So, the `title_cache` table (so-named to (hopefully) set the appropriate expectation) serves as an inverted index - mapping titles to page IDs. This requires clients needing by-title access to first query the index/table (but is this an improvement over using the Action API?). - `instance_of`, like `title` is duplicating relationships modeled by another system. - The examples above variously include `page_rev` attributes that might prove useful in qualifying the datasets. YMMV.

==== **User Story** > ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process ==== Success Criteria - [ ] Schema stores all fields from output - [ ] Supports retrieval of data set records by project & page ID - [ ] Optionally(?) supports lookup/retrieval by project & page title - [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data ==== Out of scope - Storage of data for training ---- ==== Cassandra Schema ==== IMPORTANT: Work in progress! {P17599}{P21420} **NOTES:** # Since we are after write semantics that will allow us to replace all of an articles recommendations at once (atomically- The `items` table would be where recommendations are stored. As shown above it establishes a one-to-many relationship between a page, and an arbitrary number of recommendation IDs (a type 1 UUID that uniquely identifies each //import//, while doubling as a timestamp). Additionally, it establishes a one-to-many relationship between each recommendation ID, and the recommendations that are a part of that import. - When importing data into `items`, we could either: # Write with a TTL set to expire at some point after the next import is expected to happen. Reads would select all recommendations for a project & page ID, and late-filter past results (whether that late-filter would be encapsulated within a service (transparent), or performed by the clients, is TBD). Or... # Remove during import using a `DELETE` of items with an `id` less-than the one being inserted. The `DELETE` and `INSERT` can be wrapped in a `BATCH` statement, making them atomic - The `feedback` table simply keeps a record of user feedback (acceptance & rejection), with a copy of which attributes are needed. Assuming this table exists entirely to mask entries in `items`, the duplicated attributes should be limited to those necessary to accomplish that. - Joins between `feedback` & `items` are not possible, so queries would need to be made against both tables (concurrently), and set difference performed to elide records in `items` as needed (whether this would be encapsulated within a service (made transparent), or performed by the clients, and isolated)is TBD). - As mentioned elsewhere, this schema models the one-to-many relationship between an article and the recommended images using a map;page titles are problematic because the relationship between them and page IDs is modeled in MediaWiki (MediaWiki is canonical). The need to //query by-title// title complicates matters. So, Overwriting the `images` attribute will replace all previous recommendations with the new setthe `title_cache` table (so-named to (hopefully) set the appropriate expectation) serves as an inverted index - mapping titles to page IDs. This requires clients needing by-title access to first query the index/table (but is this an improvement over using the Action API?). # This elides a separate attribute for timestamp in lieu of using a type 1 UUID for `dataset_id`. This does not prevent us from returning a separate timestamp in queries (ala: `SELECT dataset_id, cast(dataset_id as timestamp) as insertion_ts, ...- `instance_of`, FROM ...`)like `title` is duplicating relationships modeled by another system. # The experimental service implements a multi-get style interface, what would amount to a `page_id IN (id,id, ...)` from storage, but what is proposed here only provides discrete access. Multi-get could be provided, but as Cassandra is distributed,- The examples above variously include `page_rev` attributes that might prove useful in qualifying the datasets. it's papering over the fact that it's still many requests on the backendYMMV.