Recommendation algorithm: raw output and production data models
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	gmodena
	Feb 1 2021, 10:55 AM

Description

The purpose of this task is to document and reason about the recommendation algorithm production data model. During development we would like to output the data from the service to a static files, to be used in Android. In production, data will be stored in a database, and exposed to clients via an API layer.

The current research code stores recommendations data in a set of wiki specific TSV files. We refer to these TSV files as _raw output_.

A subset of this data will be exposed to clients via an API. We’ll refer to it as _production data_.
The actual response returned to Task Recommendation clients might be enriched by callbacks; the focus of this phabricator task is on a representation of recommendations that can be stored in a database and exposed to the Task Recommendation service backend.

The _raw output_ TSVs produced by the algorithm contains the following information:

Field	Description
item_id	wikidata item id
page_id	mediawiki page id
page_title	mediawiki page title
selected	selected image
notes	Metadata (Commons)
descriptions	Metadata (Commons)
captions	Metadata (Commons)
categories	Metadata (Commons)
depicts	Metadata (Commons)
size	Metadata (Commons)
copyright	Metadata (Commons)
date	Metadata (Commons)
user	Metadata (Commons) - User:<username>
image type	Metadata (Commons)
confidence	Recommendation confidence score

This dataset is processed and stored in the analytics cluster.
Samples are (and will be) available in hadoop for analysis purposes, This data is not intended for programmatic access.

A pipeline will be put in place to derive _production data_ from the _raw output_, and ingest into a database. From there, it will be exposed to the Task Recommendation API. While we have not settled on a specific database technology yet (and the schema
definition might still undergo changes), we will need to at least expose the following fields:

Field	Description
wiki	wiki name
page_id	page name/identifier
page_title	the article title
image_id	recommended image name/identifier
source	identifier of the image source
confidence	recommendation confidence score
insertion_time	record insertion time
dataset_uuid	unique identifier of the recommendation dataset

During development we would like to output the data from the service to a static files, to be used in Android.

Related Objects

Mentioned In: T280042: New database request: image_matching

Event Timeline

gmodena created this task.Feb 1 2021, 10:55 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 1 2021, 10:55 AM

gmodena updated the task description. (Show Details)Feb 1 2021, 11:28 AM

• WDoranWMF added a project: Platform Team Workboards (Image Suggestion API).Feb 3 2021, 4:43 PM

• WDoranWMF moved this task from Backlog to In review on the Platform Team Workboards (Image Suggestion API) board.

Just adding assignment for our tracking

@JFishback_WMF just wanted to flag that a page_title column has been added to the "production data" schema.

Based on the schemas in the description, privacy risk for this data is LOW. If the data is expanded at some point in the future to capture e.g. user match recommendations we should take another look.

To address the specific questions asked in the review request:

What risk level can be associated to "raw output"? LOW
What risk level can be associated to "production data"? LOW
While this project is still in PoC phase, we are asked to share data samples with stakeholders. Are we allowed to share samples from 1 and 2? YES, see next question.
What would be the recommended delivery method? I'm not seeing any Confidential or Restricted data here (https://office.wikimedia.org/wiki/Security/Policy/Data_Classification) so transmission restrictions will be more based around file size than security considerations. At least for this initial data.

Assigning back to @gmodena as author. I think my portion is done here, but please reassign back if I missed anything.

@gmodena safe to move this to done and resolve?

@JFishback_WMF ack. Many thanks for the detailed overview.

@sdkim I'd say this is resolved.

Thank you @JFishback_WMF !

• sdkim moved this task from In review to Done on the Platform Team Workboards (Image Suggestion API) board.Mar 8 2021, 6:07 PM

gmodena mentioned this in T280042: New database request: image_matching.Apr 13 2021, 3:52 PM

Recommendation algorithm: raw output and production data modelsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Recommendation algorithm: raw output and production data models
Closed, ResolvedPublic
Actions