Page MenuHomePhabricator

Implement abstraction for Sparse Feature Vectors
Closed, ResolvedPublic


We have a problem in how we encode features in revscoring that will make it a little bit weird to use hashing vectorization. Given that hashing vectorizer produces a very large number of features (2**20 == 262,144 is common), we'll have issues storing the features and replicating them for a model.

  1. We write "feature" files that contain extracted features for a train/test set in a TSV format. Each row is an observation and each column is a feature. Currently, these files have ~70 columns. We'd make them 3800x bigger if we naively encoded every hashed feature into a column! So, I think we'll need a sort of specialized feature abstraction. I'd like to somehow write a single column to the file per hashing. I've looked into string encodings of sparse matrices[1], but it looks like there's no obvious solution. I'd like to see if we can work out a JSON-based intermediary format of some sort.
  2. We store revscoring.Features in the model file that we generate. This is nice because it makes it easier to replicate the exact feature set used to train a model when scoring new revisions. Again, storing a revscoring.Feature for every values in the hash vector would be crazy. So, I think we'll want a simple revscoring.FeatureSet abstraction that can represent a specific HashVector or like. We'll need a pre-processor built into our revscoring.ScorerModels that will allow them to take a revscoring.FeatureSet as a single value and expand it before combining the real values with the rest of the features and passing them all to the wrapped classifier model.

This card is done when we have a PR merged to revscoring that allows for tractable use of hash vectors as feature values.

Event Timeline

Halfak triaged this task as Low priority.Jul 5 2016, 2:35 PM

I started looking at this. See some work here:

I think that we want a special "featurevector" type of feature. I've been exploring how we'd do Hashing and Tfidf to help me think about some of the harder bits.

It looks like we'll want a train/fit pattern for this.

It seems like we can do some pipelining like sklearn does. It should work well for our dependency style setup.

OK. More progress. I imagined what it might look like to use the abstractions that I'm imagining.


I think this isn't too crazy. But I'll essentially be ignoring some of the performance optimizations that sklearn provides in order to implement it.

It would be a little bit weird to implement the training process for the tfidf selector.

My notes from IRC:

[16:58:59] <halfak> OK.  Cleaned up the vector stuff more.  So right now, I have (1) a viable strategy for gramming, hashing, tfidf selection, and serialization and (2) working tests for all that stuff.  
[16:59:41] <halfak> I think I'll be working to bridge the gap between a feature vector with sub-vectors and the way that ScorerModels behave.  
[17:00:23] <halfak> I think we'll want a utility for training selectors like tfidf and estimator-based strategies
[17:00:38] <halfak> I suppose I'll work on a demo before I start worrying about what the utility is going to look like. 
[17:00:46] <halfak> I'm calling it a day.
[17:00:46] <halfak> o/

More cleanup and I've implemented the vector pattern in the tests for scoring models. I think that all that really remains for this work is setting up a demo. Once I can demo the work from T128087, I think it'll be time to submit this for review.

I did some work yesterday in PAWS but I kept running into memory issues that made extracting the hash delta tables impossible. See

After thinking for a while I started work on some new utilities for revscoring -- namely the extract utility -- which is a generalization of extract_features that will work for any datasource. I ended up coming up with a scheme based on JSON file formats that I think will work well for these use-cases.

An observation is a line of JSON that can contain the following fields:

  • rev_id
  • cache
  • <label>

Where <label> can be an arbitrarily named value for use in predictions. (e.g. "damaging" or "goodfaith".

"cache" contains a flat mapping between Dependent names and values. A cache is populated using the extract utility. The extract utility allows a user to specify a set of values that should be added to the "cache" for future use. For example, one can specify editquality.feature_lists.enwiki.damaging and specifically revscoring.features.wikitext.revision.datasources.words be included as well.

OK. So now will be able to use this cache-rich datasets to extract both feature values and the values that are useful in training a FeatureVector selector (like TFiDF) as well.

I'm just now realizing that there are some datasources that we really want to use that will not work with this JSON encoding strategy. Specifically, term frequency tables that use hashing to convert terms to integers. Note how the keys are converted into strings in the example below. Arg. Back to the drawing board.

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> table = {1: 10, 2: -1}
>>> json.loads(json.dumps(table))
{'2': -1, '1': 10}
>>> table == json.loads(json.dumps(table))

OK. So thinking about the serialization problem. I think that it might be easy to just use pickle or something like it. But if we do, then we'll lose the human-readable element. I'm really not liking this idea, but I don't think that we have a good alternative. The best that I can find is repr/eval, but that's not safe and it's an uncommon pattern.


Broken, but human readable

>>> json.dumps({1: 10, 2: -1})
'{"1": 10, "2": -1}'
>>> json.loads(json.dumps({1: 10, 2: -1})) == {1: 10, 2: -1}

Pickle, dill, msgpack

Works, but unreadable. Msgpack is the smallest.

>>> pickle.dumps({1: 10, 2: -1})
>>> pickle.loads(pickle.dumps({1: 10, 2: -1})) == {1: 10, 2: -1}
>>> dill.dumps({1: 10, 2: -1})
>>> dill.loads(dill.dumps({1: 10, 2: -1})) == {1: 10, 2: -1}
>>> msgpack.dumps({1: 10, 2: -1})
>>> msgpack.loads(msgpack.dumps({1: 10, 2: -1})) == {1: 10, 2: -1}


Works, and human readable, but dangerous

>>> repr({1: 10, 2: -1})
'{1: 10, 2: -1}'
>>> eval(repr({1: 10, 2: -1})) == {1: 10, 2: -1}

dict-->list hack

Works, but requires previous knowledge of what lists should be converted to dicts

>>> json.dumps(list({1: 10, 2: -1}.items()))
'[[1, 10], [2, -1]]'
>>> dict(json.loads(json.dumps(list({1: 10, 2: -1}.items())))) == {1: 10, 2: -1}

OK... So one alternative is to have datasources return lists of pairs. This is weird, but it might be performant in the end since we spend a lot of time iterating through complete dictionaries anyway.

>>> import time
>>> def read(iterable):
...     for val in iterable:
...         pass                
>>> list_of_pairs = [(i, i*i) for i in range(10000)]
>>> start = time.time();_ = read(dict(list_of_pairs) for i in range(10000));duration = time.time() - start

>>> print("Rolling a 10k pair list into a dict takes", duration/10000, "seconds.")
Rolling a 10k pair list into a dict takes 0.0005035170555114746 seconds.
>>> dict_of_pairs = dict(list_of_pairs)
>>> start = time.time();_ = read(read(dict_of_pairs.items()) for i in range(10000));duration = time.time() - start
>>> print("Unroll a 10k item dict", duration/10000, "seconds.")
Unroll a 10k item dict 0.00030477705001831056 seconds.

It looks like rolling and unrolling a dict is *very* fast. It would be nice if we could only do it on serialization, but maybe we could do it all the time.

No good options. Let me describe the two I am seriously considering.

1. Just use pickle, dill, msgpack, etc.

This will make lines of observations look like this:

{"rev_id": 3456783, "damaging": true, "goodfaith": false, "cache": "gS}*7bYXLEb#h~6E^=jdX>)0BZZ36mWpXZbWq5S2GBGnQF*G$UF*Y<VF)%iWMOn+b8mHWV`VOKWp-(EX>V>Wb#rBME@@-+5kOfWnpx6a%C=TWo{@uWMOn+b8mHWV`VOMZ);_4X?kU3C}d%DVRLVFa${vKa%FaDb7^mGE_7vhbSXY5%>je'"}`

Encoding will look like:

ob['cache'] = str(base64.b85encode(pickle.dumps(ob['cache'])), 'ascii')

Decoding will look like:

ob['cache'] = pickle.loads(base64.b85decode(bytes(cache_str, 'ascii')))

2. No dicts with non-string keys can be returned by datasources

This is just crazy pants. Some datasources are already returning non-JSONable data types. I'm just going to ignore it from now on.

So there we have it. I guess I'm going to implement some crazy serialization.

AIUI, both pickle and dill allow for code execution if unserializing untrusted input...I'm not sure if that's a concern, but just noting it since eval/repr was marked as dangerous.

Running tests producing hash tables for It looks like we're pretty slow.

With grams:

my_grams = [(0,), (0,1),
            (0,1,2), (0,2),
            (0,1,2,3), (0,2,3), (0,1,3), (0,3)]

Using sklearn's hash transformer:

$ python 
Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.
Tokenization into 7880 words: 0.1085 seconds
Gramming into 63023 grams: 0.1583 seconds
Hashed grams: 1.2605 seconds
Hash-grams: 1.507 seconds

Using mumuhash3's 32bit hash():

$ python 
Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.
Tokenization into 7880 words: 0.1062 seconds
Gramming into 63023 grams: 0.1876 seconds
Hashed grams: 0.4709 seconds
Hash-grams: 0.7386 seconds

OK. I did some more optimizations and I reduced the gram set as follows:

Then I added some tests to compare our performance against the raw HashingVectorizer class from sklearn. See the results below:

Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.
Text --> 7880 words: 0.1063 seconds
Words --> 31514 grams: 0.0844 seconds
Grams --> 31514 hashes: 0.2403 seconds
Words --> 31514 hash-grams: 0.3006 seconds
Words --> 31514 hashed grams: 0.3188 seconds
Words --> Hash table 0.3246 seconds
Words --> HashingVectorizer transform 0.2941 seconds

So, we score in at 0.32 seconds and HashingVectorizer can manage 0.29 seconds. This is hardly a difference. I'm declaring victory on performance.

OK. We're good to go. I have everything wrapped up in one giant commit right now. I'll be coming back to this tomorrow to split it up into a few logical commits. For now, see

I'll ping with a new pull request with commits broken up later.