I'd like to include the extracted features in Hadoop. This is a lot of additional data so I wanted to raise this with Analytics as an open question. For each score (0.4–3KB), this would roughly double the amount of data we need to store.
The raw features will be valuable for several reasons, including:
- Potentially useful to researchers and tool authors.
- Could dramatically reduce our time to retrain and adjust existing models.
- Would create exciting possibilities for training or scoring directly in Hadoop.
A detail worth mentioning is that feature_name might have hundreds of values for each model. These features are "upstream" of the model_version so have their own time window for validity. One complication is that feature directories should properly be deleted in the rare event that we need to change how a feature's value is calculated, but not change the feature's name.