Hi,
With this task I'd like to start a discussion around creating a new database to support the Image Suggestion API service.
Image Matching
The context is a cross-team project to suggest images to illustrate Wikipedia articles.
Platform Engineering developed an API and a data pipeline that coordinates the generation of an heuristic (expert-based) model, and publishes datasets (lookup tables) to analytics.wikimedia.org/published. We are considering Swift for production use. Currently the API is hosted on Toolforge VMs, and loads data locally to an in-memory sqlite database.
The project is currently in a PoC phase, but PET will develop and support it once it matures to production. For production, we would like to ingest these datasets into a database, so that the API will be able to consume recommendations from WMF's prod kube cluster.
Data Model
We expect a read-heavy load, with periodic (monthly) high-throughput writes. The data pipeline will publish a static dataset that represents state at model training time. The database will not be updated during its lifetime (e.g. in between training schedules).
Keeping track of state changes will be addressed by a dedicated component and caching mechanism.
- Number of records:: 15.000.000 (PoC) to TBD (prod).
- Read access pattern: a lookup of (wiki, page_id). The result set can contain k records (initially k = 3).
- Write access pattern: periodic (monthly) high throughput write load. Old records can be discarded, and the dataset refreshed.
- Updates: data can be discarded and replaced once a new dataset is available (monthly)
- QPS (reads): the range is millions down to 1000s daily requests. The primary clients are android and bot writers. We expect Bot to have bursts of activity.
- Data volumes: 1.5GB (PoC) to the order of 10s of GBs (prod - estimate)
- Data model assumptions:
- Data is not a time series (conceptually it’s a single datapoint of a time series).
- Data does not depend on previous state / no need to track history.
- The dataset is denormalized and stored in a single schema.
- We’ll only perform lookups. No range or analytical queries, joins, etc.
- We lookup for a (page, wik), and should return multiple recommendations.
Dataset updates
To meet SLAs, we want to keep providing recommendations while the datasets are refreshed. Currently we are considering
a snapshot strategy: loading the new snapshot should be done while serving data from the existing one,
and once loading is finished and vetted, then the serving layer swaps storage snapshot,
and the old one can be deleted (after some time possibly, allowing for rollback). This allows for easy rollback, at the cost of datasize.
We need to validate the implications of this approach on replication strategies applied on the db hosts.
Example
A sample sqlite DDL we use for development and testing can be found at https://gist.github.com/gmodena/8ffd0ecc2051703bb0d6391d85c44879
Access from service
The API should access the database from kubernetes*.<eqiad|codfw>.wmnet, kubestage*.<eqiad|codfw>.wmnet,.
We will need access from a host outside Hadoop VLANs that coordinates ingestion (TBD).
Backup policy
We probably won't need backups. Data can be re-generated from Hadoop if needed.
Privacy
The data we wish to store has been approved by the Privacy Review Process.
Related phab: https://phabricator.wikimedia.org/T273434#6825118
SLO
Work In Progress.
Desires
Currently we identified the following desirable properties:
- We would like dedicated hosts outside the "misc" pool.
- We would like the ingestion process to be automated.
- We would like to support high throughput writes without aggressive throttling and batch sizes.
- We would like no API downtime during writes.
- We would like to (re) index the dataset schemas according to the desired read access pattern (index by wiki, page_id).
- We would to keep store copies (possibly variants) of the datasets.
- (Maybe) we would like to store support tables for the API.
- (Maybe) we would like to consolidate data pipelines, and migrate similarusers to this database (hosts).