Maniphest T193789

[Discuss] Storage of model training/testing datasets
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Sumit
	May 3 2018, 7:26 PM

Description

We should have a good rule of thumb for any input dataset that an ORES model uses.

This task was originally created to discuss about the randomization in sampling of 2k articles per mid-level category using PAWS vs. manually sampling them after fetching certain number of articles per WikiProject using queries like:
https://en.wikipedia.org/w/api.php?action=query&generator=embeddedin&geititle=Template:WikiProject%20Accessibility&geinamespace=1&prop=info Essentially, we decided to store the output of @Sumit's script in the repo to preserve the history of it.

Should we do that for all samples? What is a good set of guidelines?

Event Timeline

Sumit created this task.May 3 2018, 7:26 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 3 2018, 7:26 PM

Thanks for making the task! I'm going to start with the exhaustive, MW API-based approach that I saw was mentioned in the IRC backscroll. Even in the crude form it's in now, I can pull 23k article titles for WikiProject Medicine in ten seconds or so, and do the sampling in memory. Another benefit of this approach is that we can get /somewhat/ stable samples by setting the random seed (note to self: and by ordering the results by page ID).

• Vvjjkkii renamed this task from [Discuss] Random sampling by PAWS vs API requests to aodaaaaaaa.Jul 1 2018, 1:12 AM

• Vvjjkkii triaged this task as High priority.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

Sumit renamed this task from aodaaaaaaa to [Discuss] Random sampling by PAWS vs API requests.Jul 1 2018, 10:23 AM

Sumit lowered the priority of this task from High to Low.

Sumit removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

Sumit updated the task description. (Show Details)

CommunityTechBot raised the priority of this task from Low to Needs Triage.Jul 5 2018, 6:36 PM

It seems to me that we want to have a long-term snapshot of input data. This is helpful when we discover some type of statistical anomaly, it'll be nice to check the *exact sample* that data originates from. So, if we use the API, then it'll be important that we keep a snapshot of the data in our repo.

awight edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Sep 17 2018, 4:35 PM

Halfak renamed this task from [Discuss] Random sampling by PAWS vs API requests to [Discuss] Storage of model training/testing datasets.Jan 24 2019, 3:46 PM

Halfak added a project: ORES.

Halfak updated the task description. (Show Details)

I propose a simple guideline. For cases where a specific version of a dataset is stored long term with a unique identifier, it is OK to not store that dataset in our repositories. Instead, we'll include the unique identifier in our Makefiles/configuration.

For cases where a specific version of a dataset is not stored long term or cannot be exactly re-created using a unique identifier, we should commit the simplest version of the dataset to our repositories for re-use, but we should also maintain the provenance of the datset in our Makefiles/configuration.

Halfak moved this task from Unsorted to Documentation on the Machine-Learning-Team board.Jan 24 2019, 3:48 PM

We decided to store the file on a public repository (figshare) and reference that directly from the makefile. See https://ndownloader.figshare.com/files/9828517

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Feb 26 2019, 10:13 PM

[Discuss] Storage of model training/testing datasetsClosed, ResolvedPublicActions

Description

Event Timeline

[Discuss] Storage of model training/testing datasets
Closed, ResolvedPublic
Actions