Maniphest T192617

Consider committing all non-private datasets to the repos
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	awight
	Apr 20 2018, 12:04 AM

Description

This directory gets really out of hand when models are fully built, adding up to 49GB for articlequality alone. These files aren't needed in production, so we don't check them in for the most part, but certain non-repeatable (e.g. shuffled) files are committed.

Looking at the model-building workflow, we should be sharing the datasets files rather than recalculate, and should be updating them and resharing. git-lfs would be a great fit.

The tricky part of this task is that we sometimes want to only hydrate the files for a single target. I'm not aware of any way to make a git submodule optional, and getting a development environment including many unrelated repos sounds annoying at first glance, at least.

Related Objects
Search...

Status	Assigned	Task
Duplicate	None	T197180 Try to increase ORES deployment parallelism
Resolved	Ladsgroup	T197096 [Epic] Use LFS for large ORES files
Declined	None	T192617 Consider committing all non-private datasets to the repos

Event Timeline

awight created this task.Apr 20 2018, 12:04 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 20 2018, 12:04 AM

awight added a parent task: T197096: [Epic] Use LFS for large ORES files.Jun 13 2018, 11:02 AM

awight renamed this task from Consider storing all files in datasets to Consider committing all non-private datasets to the repos.Jun 13 2018, 8:16 PM

awight updated the task description. (Show Details)Jun 13 2018, 10:03 PM

awight moved this task from Unsorted to Blocked on community input on the Machine-Learning-Team board.Jun 20 2018, 3:02 PM

No, we should not do this. Git LFS actually doesn't allow us to store large text files, it performs very poorly in these cases because in git-lfs objects don't get compressed and git does a great job of compressing everything that it can delta (anything but binary files) so using git-lfs only makes sense for binary files and not text files.

We can always compress these files, but I agree that it still wouldn't be as good as diff-compression. We probably don't want to store cache files, but we could store labeled datasets without too much of a size hit. We've been bit in the past by having labeled data change on us due to changes in how we call the editquality autolabel utility.

If we were to move forward with this, we'd want an estimate of the size of all labeled data files and then have a discussion about how this would impact the repo download size.

Halfak moved this task from Blocked on community input to Research & analysis on the Machine-Learning-Team board.Feb 11 2019, 9:27 PM

awight unsubscribed.Mar 21 2019, 4:03 PM

Declining as this topic hasn't come up recently and we have some good guidelines for checking in datasets (anything that can't be mostly reproduced)

Consider committing all non-private datasets to the reposClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Consider committing all non-private datasets to the repos
Closed, DeclinedPublic
Actions

Related Objects
Search...