I'd like to include the extracted features in Hadoop. This is a lot of additional data so I wanted to raise this with #analytics as an open question. For each score (0.4–3KB), this would roughly double the amount of data we need to store.
The raw features will be valuable for several reasons, including:
* Potentially useful to researchers and tool authors.
* Could dramatically reduce our time to retrain and adjust existing models.
* Would create exciting possibilities for training or scoring directly in Hadoop.
A detail worth mentioning is that `feature_name` might have hundreds of values for each model, and that these features are "upstream" of the model_version and have their own window of validity. One complication is that feature directories should be deleted in the rare event that we need to change how a feature's value is calculated, without changing the feature's name.