Change Details

==== **User Story** > ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process ==== Success Criteria - [ ] Schema stores all fields from output - [ ] Supports retrieval of data set records by project & page ID - [ ] Optionally(?) supports lookup/retrieval by project & page title - [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data ==== Out of scope - Storage of data for training ---- ==== Cassandra Storage ==== {P21420} ===== Proposal ===== - From a Product perspective, the `suggestions` table is The Dataset (i.e could appear in a catalog of published datasets for reuse). It establishes a one-to-many relationship between a page (identified by the `(wiki,page_id)` tuple), and an arbitrary number of suggestion IDs ([[ https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_1_(date-time_and_MAC_address) | a type-1 UUID ]]). As this dataset is the product of a batch analytics job, and generated periodically, the suggestion ID (`suggetions.id`) corresponds to each batch run. There is a one-to-many relationship between suggestion IDs, the images suggested, and the attributes that correspond to each. - The `feedback` table keeps a record of user-supplied feedback for image suggestions. It is considered application state for Structured Data & Growth's use-cases, and not a part of the image suggestions dataset. - The `title_cache` and `instanceof_cache` tables store attribute relationships that are canonically modeled in other systems (MediaWikis), and are only maintained here for convenience. - Retention of data in the `suggestions` table will be managed by TTLs. The length of the TTL will be a multiple of the update frequency that provides some historical results (and a buffer against late/missing batch jobs), while keeping result sets bounded for performance. Data in the other tables is maintained in perpetuity, and/or overwritten as required - Since joins between these tables are not possible, multiple queries will be needed in some scenarios. For example, if //relevant// image suggestions are those without feedback, then separate queries of `suggestions` & `feedback` will need to be performed (can be performed concurrently), and set difference performed. Lookups by table name will first require a query against `title_cache` to find the page ID, and then followed up with a query to `suggestions`.

==== **User Story** > ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process ==== Success Criteria - [ ] Schema stores all fields from output - [ ] Supports retrieval of data set records by project & page ID - [ ] Optionally(?) supports lookup/retrieval by project & page title - [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data ==== Out of scope - Storage of data for training ---- ==== Cassandra Storage ==== {P21420} ===== Proposal ===== - From a Product perspective, the `suggestions` table is The Dataset (i.e could appear in a catalog of published datasets for reuse). It establishes a one-to-many relationship between a page (identified by the `(wiki,page_id)` tuple), and an arbitrary number of suggestion IDs ([[ https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_1_(date-time_and_MAC_address) | a type-1 UUID ]]). As this dataset is the product of a batch analytics job, and generated periodically, the suggestion ID (`suggetions.id`) corresponds to each batch run. There is a one-to-many relationship between suggestion IDs, the images suggested, and the attributes that correspond to each. - The `feedback` table keeps a record of user-supplied feedback for image suggestions. It is considered application state for Structured Data & Growth's use-cases, and not a part of the image suggestions dataset. - The `title_cache` and `instanceof_cache` tables store attribute relationships that are canonically modeled in other systems (MediaWikis), and are only maintained here for convenience. - Retention of data in the `suggestions` table will be managed by TTLs. The length of the TTL will be a multiple of the update frequency that provides some historical results (and a buffer against late/missing batch jobs), while keeping result sets bounded for performance (clients will receive the full result set, even when they only require the most recent). - Since joins between these tables are not possible, multiple queries will be needed in some scenarios. For example, if //relevant// image suggestions are those without feedback, then separate queries of `suggestions` & `feedback` will need to be performed (can be performed concurrently), and set difference performed. Lookups by table name will first require a query against `title_cache` to find the page ID, and then followed up with a query to `suggestions`.

==== **User Story** > ==== As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process ==== Success Criteria - [ ] Schema stores all fields from output - [ ] Supports retrieval of data set records by project & page ID - [ ] Optionally(?) supports lookup/retrieval by project & page title - [ ] Storage accommodates bulk import of new records, expiry / deletion of stale data ==== Out of scope - Storage of data for training ---- ==== Cassandra Storage ==== {P21420} ===== Proposal ===== - From a Product perspective, the `suggestions` table is The Dataset (i.e could appear in a catalog of published datasets for reuse). It establishes a one-to-many relationship between a page (identified by the `(wiki,page_id)` tuple), and an arbitrary number of suggestion IDs ([[ https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_1_(date-time_and_MAC_address) | a type-1 UUID ]]). As this dataset is the product of a batch analytics job, and generated periodically, the suggestion ID (`suggetions.id`) corresponds to each batch run. There is a one-to-many relationship between suggestion IDs, the images suggested, and the attributes that correspond to each. - The `feedback` table keeps a record of user-supplied feedback for image suggestions. It is considered application state for Structured Data & Growth's use-cases, and not a part of the image suggestions dataset. - The `title_cache` and `instanceof_cache` tables store attribute relationships that are canonically modeled in other systems (MediaWikis), and are only maintained here for convenience. - Retention of data in the `suggestions` table will be managed by TTLs. The length of the TTL will be a multiple of the update frequency that provides some historical results (and a buffer against late/missing batch jobs), while keeping result sets bounded for performance. Data in the other tables is maintained in perpetuity (clients will receive the full result set, and/or overwritten as requiredeven when they only require the most recent). - Since joins between these tables are not possible, multiple queries will be needed in some scenarios. For example, if //relevant// image suggestions are those without feedback, then separate queries of `suggestions` & `feedback` will need to be performed (can be performed concurrently), and set difference performed. Lookups by table name will first require a query against `title_cache` to find the page ID, and then followed up with a query to `suggestions`.