User Story
As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process
Success Criteria
- Schema stores all fields from output
- Supports retrieval of data set records by project & page ID
- Optionally(?) supports lookup/retrieval by project & page title
- Storage accommodates bulk import of new records, expiry / deletion of stale data
Out of scope
- Storage of data for training
Cassandra Storage
Proposal
- From a Product perspective, the suggestions table is The Dataset (i.e could appear in a catalog of published datasets for reuse). It establishes a one-to-many relationship between a page (identified by the (wiki,page_id) tuple), and an arbitrary number of suggestion IDs (a type-1 UUID). As this dataset is the product of a batch analytics job, and generated periodically, the suggestion ID (suggetions.id) corresponds to each batch run. There is a one-to-many relationship between suggestion IDs, the images suggested, and the attributes that correspond to each.
- The feedback table keeps a record of user-supplied feedback for image suggestions. It is considered application state for Structured Data & Growth's use-cases, and not a part of the image suggestions dataset.
- The title_cache and instanceof_cache tables store attribute relationships that are canonically modeled in other systems (MediaWikis), and are only maintained here for convenience.
- Retention of data in the suggestions table will be managed by TTLs. The length of the TTL will be a multiple of the update frequency that provides some historical results (and a buffer against late/missing batch jobs), while keeping result sets bounded for performance (clients will receive the full result set, even when they only require the most recent).
- Since joins between these tables are not possible, multiple queries will be needed in some scenarios. For example, if relevant image suggestions are those without feedback, then separate queries of suggestions & feedback will need to be performed (can be performed concurrently), and set difference performed. Lookups by table name will first require a query against title_cache to find the page ID, and then followed up with a query to suggestions.