Change Details

==== **User Story** > ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process ==== Success Criteria - [ ] Schema stores all fields from output - [ ] Supports retrieval of data set records by project & page ID - [ ] Optionally(?) supports lookup/retrieval by project & page title - [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data ==== Out of scope - Storage of data for training ---- ==== Cassandra Schema ==== IMPORTANT: Work in progress! {P21420} **NOTES:** - The `items` table would be where recommendations are stored. As shown above it establishes a one-to-many relationship between a page, and an arbitrary number of recommendation IDs (a type 1 UUID that uniquely identifies each //import//, while doubling as a timestamp). Additionally, it establishes a one-to-many relationship between each recommendation ID, and the recommendations that are a part of that import. - When importing data into `items`, we could either: # Write with a TTL set to expire at some point after the next import is expected to happen. Reads would select all recommendations for a project & page ID, and late-filter past results (whether that late-filter would be encapsulated within a service (transparent), or performed by the clients, is TBD). Or... # Remove during import using a `DELETE` of items with an `id` less-than the one being inserted. The `DELETE` and `INSERT` can be wrapped in a `BATCH` statement, making them atomic - The `feedback` table simply keeps a record of user feedback (acceptance & rejection), with a copy of which attributes are needed. Assuming this table exists entirely to mask entries in `items`, the duplicated attributes should be limited to those necessary to accomplish that. User feedback is a one-to-many relationship between an image with feedback, and the generated feedback; An arbitrary number of accepts and/or rejects can be registered for a single image. - Joins between `feedback` & `items` are not possible, so queries would need to be made against both tables (concurrently), and set difference performed to elide records in `items` as needed (whether this would be encapsulated within a service (made transparent), or performed by the clients, is TBD). - As mentioned elsewhere, page titles are problematic because the relationship between them and page IDs is modeled in MediaWiki (MediaWiki is canonical). The need to //query by-title// title complicates matters. So, the `title_cache` table (so-named to (hopefully) set the appropriate expectation) serves as an inverted index - mapping titles to page IDs. This requires clients needing by-title access to first query the index/table (but is this an improvement over using the Action API?). - `instance_of`, like `title` is duplicating relationships modeled by another system. - The examples above variously include `page_rev` attributes that might prove useful in qualifying the datasets. YMMV.

==== **User Story** > ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process ==== Success Criteria - [ ] Schema stores all fields from output - [ ] Supports retrieval of data set records by project & page ID - [ ] Optionally(?) supports lookup/retrieval by project & page title - [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data ==== Out of scope - Storage of data for training ---- ==== Cassandra Schema ==== IMPORTANT: Work in progress! {P21420} **NOTES:** - The `items` table would be where recommendations are stored. As shown above it establishes a one-to-many relationship between a page, and an arbitrary number of recommendation IDs (a type 1 UUID that uniquely identifies each //import//, while doubling as a timestamp). Additionally, it establishes a one-to-many relationship between each recommendation ID, and the recommendations that are a part of that import. - When importing data into `items`, we could do any of: # Write with a TTL set to expire at some point after the next import is expected to happen. Reads would select all recommendations for a project & page ID, and late-filter past results (whether that late-filter would be encapsulated within a service (transparent), or performed by the clients, is TBD). Or... # Remove during import using a `DELETE` of items with an `id` less-than the one being inserted. The `DELETE` and `INSERT` can be wrapped in a `BATCH` statement, making them atomic (this of course only works to delete past entries of items //currently// being updated/inserted). # Issue a `TRUNCATE` and drop the entire contents of the table (all recommendations will be lost until the import repopulates) - The `feedback` table simply keeps a record of user feedback (acceptance & rejection), with a copy of which attributes are needed. Assuming this table exists entirely to mask entries in `items`, the duplicated attributes should be limited to those necessary to accomplish that. User feedback is a one-to-many relationship between an image with feedback, and the generated feedback; An arbitrary number of accepts and/or rejects can be registered for a single image. - Joins between `feedback` & `items` are not possible, so queries would need to be made against both tables (concurrently), and set difference performed to elide records in `items` as needed (whether this would be encapsulated within a service (made transparent), or performed by the clients, is TBD). - As mentioned elsewhere, page titles are problematic because the relationship between them and page IDs is modeled in MediaWiki (MediaWiki is canonical). The need to //query by-title// title complicates matters. So, the `title_cache` table (so-named to (hopefully) set the appropriate expectation) serves as an inverted index - mapping titles to page IDs. This requires clients needing by-title access to first query the index/table (but is this an improvement over using the Action API?). - `instance_of`, like `title` is duplicating relationships modeled by another system. - The examples above variously include `page_rev` attributes that might prove useful in qualifying the datasets. YMMV.

==== **User Story** > ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process ==== Success Criteria - [ ] Schema stores all fields from output - [ ] Supports retrieval of data set records by project & page ID - [ ] Optionally(?) supports lookup/retrieval by project & page title - [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data ==== Out of scope - Storage of data for training ---- ==== Cassandra Schema ==== IMPORTANT: Work in progress! {P21420} **NOTES:** - The `items` table would be where recommendations are stored. As shown above it establishes a one-to-many relationship between a page, and an arbitrary number of recommendation IDs (a type 1 UUID that uniquely identifies each //import//, while doubling as a timestamp). Additionally, it establishes a one-to-many relationship between each recommendation ID, and the recommendations that are a part of that import. - When importing data into `items`, we could either:do any of: # Write with a TTL set to expire at some point after the next import is expected to happen. Reads would select all recommendations for a project & page ID, and late-filter past results (whether that late-filter would be encapsulated within a service (transparent), or performed by the clients, is TBD). Or... # Remove during import using a `DELETE` of items with an `id` less-than the one being inserted. The `DELETE` and `INSERT` can be wrapped in a `BATCH` statement, making them atomic (this of course only works to delete past entries of items //currently// being updated/inserted). # Issue a `TRUNCATE` and drop the entire contents of the table (all recommendations will be lost until the import repopulates) - The `feedback` table simply keeps a record of user feedback (acceptance & rejection), with a copy of which attributes are needed. Assuming this table exists entirely to mask entries in `items`, the duplicated attributes should be limited to those necessary to accomplish that. User feedback is a one-to-many relationship between an image with feedback, and the generated feedback; An arbitrary number of accepts and/or rejects can be registered for a single image. - Joins between `feedback` & `items` are not possible, so queries would need to be made against both tables (concurrently), and set difference performed to elide records in `items` as needed (whether this would be encapsulated within a service (made transparent), or performed by the clients, is TBD). - As mentioned elsewhere, page titles are problematic because the relationship between them and page IDs is modeled in MediaWiki (MediaWiki is canonical). The need to //query by-title// title complicates matters. So, the `title_cache` table (so-named to (hopefully) set the appropriate expectation) serves as an inverted index - mapping titles to page IDs. This requires clients needing by-title access to first query the index/table (but is this an improvement over using the Action API?). - `instance_of`, like `title` is duplicating relationships modeled by another system. - The examples above variously include `page_rev` attributes that might prove useful in qualifying the datasets. YMMV.