Page MenuHomePhabricator

[Discuss] Storage of model training/testing datasets
Closed, ResolvedPublic

Description

We should have a good rule of thumb for any input dataset that an ORES model uses.

This task was originally created to discuss about the randomization in sampling of 2k articles per mid-level category using PAWS vs. manually sampling them after fetching certain number of articles per WikiProject using queries like:
https://en.wikipedia.org/w/api.php?action=query&generator=embeddedin&geititle=Template:WikiProject%20Accessibility&geinamespace=1&prop=info Essentially, we decided to store the output of @Sumit's script in the repo to preserve the history of it.

Should we do that for all samples? What is a good set of guidelines?

Event Timeline

Thanks for making the task! I'm going to start with the exhaustive, MW API-based approach that I saw was mentioned in the IRC backscroll. Even in the crude form it's in now, I can pull 23k article titles for WikiProject Medicine in ten seconds or so, and do the sampling in memory. Another benefit of this approach is that we can get /somewhat/ stable samples by setting the random seed (note to self: and by ordering the results by page ID).

Vvjjkkii renamed this task from [Discuss] Random sampling by PAWS vs API requests to aodaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Sumit renamed this task from aodaaaaaaa to [Discuss] Random sampling by PAWS vs API requests.Jul 1 2018, 10:23 AM
Sumit lowered the priority of this task from High to Low.
Sumit updated the task description. (Show Details)
CommunityTechBot raised the priority of this task from Low to Needs Triage.Jul 5 2018, 6:36 PM

It seems to me that we want to have a long-term snapshot of input data. This is helpful when we discover some type of statistical anomaly, it'll be nice to check the *exact sample* that data originates from. So, if we use the API, then it'll be important that we keep a snapshot of the data in our repo.

Halfak renamed this task from [Discuss] Random sampling by PAWS vs API requests to [Discuss] Storage of model training/testing datasets.Jan 24 2019, 3:46 PM
Halfak added a project: ORES.
Halfak updated the task description. (Show Details)

I propose a simple guideline. For cases where a specific version of a dataset is stored long term with a unique identifier, it is OK to not store that dataset in our repositories. Instead, we'll include the unique identifier in our Makefiles/configuration.

For cases where a specific version of a dataset is not stored long term or cannot be exactly re-created using a unique identifier, we should commit the simplest version of the dataset to our repositories for re-use, but we should also maintain the provenance of the datset in our Makefiles/configuration.

Halfak claimed this task.

We decided to store the file on a public repository (figshare) and reference that directly from the makefile. See https://ndownloader.figshare.com/files/9828517