We have a problem in how we encode features in revscoring that will make it a little bit weird to use hashing vectorization. Given that hashing vectorizer produces a very large number of features (2**20 == 262,144 is common), we'll have issues storing the features and replicating them for a model.
- We write "feature" files that contain extracted features for a train/test set in a TSV format. Each row is an observation and each column is a feature. Currently, these files have ~70 columns. We'd make them 3800x bigger if we naively encoded every hashed feature into a column! So, I think we'll need a sort of specialized feature abstraction. I'd like to somehow write a single column to the file per hashing. I've looked into string encodings of sparse matrices, but it looks like there's no obvious solution. I'd like to see if we can work out a JSON-based intermediary format of some sort.
- We store revscoring.Features in the model file that we generate. This is nice because it makes it easier to replicate the exact feature set used to train a model when scoring new revisions. Again, storing a revscoring.Feature for every values in the hash vector would be crazy. So, I think we'll want a simple revscoring.FeatureSet abstraction that can represent a specific HashVector or like. We'll need a pre-processor built into our revscoring.ScorerModels that will allow them to take a revscoring.FeatureSet as a single value and expand it before combining the real values with the rest of the features and passing them all to the wrapped classifier model.
This card is done when we have a PR merged to revscoring that allows for tractable use of hash vectors as feature values.