I'd like to include the extracted features in Hadoop. This is a lot of additional data so I wanted to raise this with Analytics as an open question. For each score (0.4–3KB), this would roughly double the amount of data we need to store.
Potential path:
/wmf/data/ores/feature/wiki=enwiki/model=editquality/feature_name=feature.wikitext.revision.parent.external_links
The raw features will be valuable for several reasons, including:
- Potentially useful to researchers and tool authors.
- Could dramatically reduce our time to retrain and adjust existing models.
- Would create exciting possibilities for training or scoring directly in Hadoop.
A detail worth mentioning is that feature_name might have hundreds of values for each model. These features are "upstream" of the model_version so have their own time window for validity. One complication is that feature directories should properly be deleted in the rare event that we need to change how a feature's value is calculated, but not change the feature's name.