Page MenuHomePhabricator

Recommendation algorithm: raw output and production data models
Closed, ResolvedPublic

Description

The purpose of this task is to document and reason about the recommendation algorithm production data model. During development we would like to output the data from the service to a static files, to be used in Android. In production, data will be stored in a database, and exposed to clients via an API layer.

The current research code stores recommendations data in a set of wiki specific TSV files. We refer to these TSV files as _raw output_.

A subset of this data will be exposed to clients via an API. We’ll refer to it as _production data_.
The actual response returned to Task Recommendation clients might be enriched by callbacks; the focus of this phabricator task is on a representation of recommendations that can be stored in a database and exposed to the Task Recommendation service backend.

The _raw output_ TSVs produced by the algorithm contains the following information:

FieldDescription
item_idwikidata item id
page_idmediawiki page id
page_titlemediawiki page title
selectedselected image
notesMetadata (Commons)
descriptionsMetadata (Commons)
captionsMetadata (Commons)
categoriesMetadata (Commons)
depictsMetadata (Commons)
sizeMetadata (Commons)
copyrightMetadata (Commons)
dateMetadata (Commons)
userMetadata (Commons) - User:<username>
image typeMetadata (Commons)
confidenceRecommendation confidence score

This dataset is processed and stored in the analytics cluster.
Samples are (and will be) available in hadoop for analysis purposes, This data is not intended for programmatic access.

A pipeline will be put in place to derive _production data_ from the _raw output_, and ingest into a database. From there, it will be exposed to the Task Recommendation API. While we have not settled on a specific database technology yet (and the schema
definition might still undergo changes), we will need to at least expose the following fields:

FieldDescription
wikiwiki name
page_idpage name/identifier
page_titlethe article title
image_idrecommended image name/identifier
sourceidentifier of the image source
confidencerecommendation confidence score
insertion_timerecord insertion time
dataset_uuidunique identifier of the recommendation dataset

During development we would like to output the data from the service to a static files, to be used in Android.

Event Timeline

WDoranWMF subscribed.

Just adding assignment for our tracking

@JFishback_WMF just wanted to flag that a page_title column has been added to the "production data" schema.

Based on the schemas in the description, privacy risk for this data is LOW. If the data is expanded at some point in the future to capture e.g. user match recommendations we should take another look.

To address the specific questions asked in the review request:

  • What risk level can be associated to "raw output"? LOW
  • What risk level can be associated to "production data"? LOW
  • While this project is still in PoC phase, we are asked to share data samples with stakeholders. Are we allowed to share samples from 1 and 2? YES, see next question.
  • What would be the recommended delivery method? I'm not seeing any Confidential or Restricted data here (https://office.wikimedia.org/wiki/Security/Policy/Data_Classification) so transmission restrictions will be more based around file size than security considerations. At least for this initial data.
JFishback_WMF moved this task from Incoming to Completed on the Privacy Engineering board.
JFishback_WMF subscribed.

Assigning back to @gmodena as author. I think my portion is done here, but please reassign back if I missed anything.

@JFishback_WMF ack. Many thanks for the detailed overview.

@sdkim I'd say this is resolved.