Page MenuHomePhabricator

Consider committing all non-private datasets to the repos
Closed, DeclinedPublic

Description

This directory gets really out of hand when models are fully built, adding up to 49GB for articlequality alone. These files aren't needed in production, so we don't check them in for the most part, but certain non-repeatable (e.g. shuffled) files are committed.

Looking at the model-building workflow, we should be sharing the datasets files rather than recalculate, and should be updating them and resharing. git-lfs would be a great fit.

The tricky part of this task is that we sometimes want to only hydrate the files for a single target. I'm not aware of any way to make a git submodule optional, and getting a development environment including many unrelated repos sounds annoying at first glance, at least.

Event Timeline

awight renamed this task from Consider storing all files in datasets to Consider committing all non-private datasets to the repos.Jun 13 2018, 8:16 PM

No, we should not do this. Git LFS actually doesn't allow us to store large text files, it performs very poorly in these cases because in git-lfs objects don't get compressed and git does a great job of compressing everything that it can delta (anything but binary files) so using git-lfs only makes sense for binary files and not text files.

We can always compress these files, but I agree that it still wouldn't be as good as diff-compression. We probably don't want to store cache files, but we could store labeled datasets without too much of a size hit. We've been bit in the past by having labeled data change on us due to changes in how we call the editquality autolabel utility.

If we were to move forward with this, we'd want an estimate of the size of all labeled data files and then have a discussion about how this would impact the repo download size.

Declining as this topic hasn't come up recently and we have some good guidelines for checking in datasets (anything that can't be mostly reproduced)