This directory gets really out of hand when models are fully built, adding up to 49GB for articlequality alone. These files aren't needed in production, so we don't check them in for the most part, but certain non-repeatable (e.g. shuffled) files are committed.
Looking at the model-building workflow, we should be sharing the datasets files rather than recalculate, and should be updating them and resharing. git-lfs would be a great fit.
The tricky part of this task is that we sometimes want to only hydrate the files for a single target. I'm not aware of any way to make a git submodule optional, and getting a development environment including many unrelated repos sounds annoying at first glance, at least.