The purpose of this task is to document and reason about the recommendation algorithm production data model. During development we would like to output the data from the service to a static files, to be used in Android. In production, data will be stored in a database, and exposed to clients via an API layer.
The current research code stores recommendations data in a set of wiki specific TSV files. We refer to these TSV files as _raw output_.
A subset of this data will be exposed to clients via an API. We’ll refer to it as _production data_.
The actual response returned to Task Recommendation clients might be enriched by callbacks; the focus of this phabricator task is on a representation of recommendations that can be stored in a database and exposed to the Task Recommendation service backend.
The _raw output_ TSVs produced by the algorithm contains the following information:
Field | Description |
item_id | wikidata item id |
page_id | mediawiki page id |
page_title | mediawiki page title |
selected | selected image |
notes | Metadata (Commons) |
descriptions | Metadata (Commons) |
captions | Metadata (Commons) |
categories | Metadata (Commons) |
depicts | Metadata (Commons) |
size | Metadata (Commons) |
copyright | Metadata (Commons) |
date | Metadata (Commons) |
user | Metadata (Commons) - User:<username> |
image type | Metadata (Commons) |
confidence | Recommendation confidence score |
This dataset is processed and stored in the analytics cluster.
Samples are (and will be) available in hadoop for analysis purposes, This data is not intended for programmatic access.
A pipeline will be put in place to derive _production data_ from the _raw output_, and ingest into a database. From there, it will be exposed to the Task Recommendation API. While we have not settled on a specific database technology yet (and the schema
definition might still undergo changes), we will need to at least expose the following fields:
Field | Description |
wiki | wiki name |
page_id | page name/identifier |
page_title | the article title |
image_id | recommended image name/identifier |
source | identifier of the image source |
confidence | recommendation confidence score |
insertion_time | record insertion time |
dataset_uuid | unique identifier of the recommendation dataset |
During development we would like to output the data from the service to a static files, to be used in Android.