Page MenuHomePhabricator

[Spike] Proof of concept damage detection with hash vectors
Closed, DuplicatePublic


I think that a good first attempt at improving on this model would be to subtract the vector extracted for the "parent" revision's text from the vector extracted for the current revision's text. This would give positive values for hashes that correspond to segments added and a negative value for hashes that correspond to segments removed.

I'd try to get this to work as a proof of concept like hashing_vectorizer.ipynb and then we can talk about engineering and eventually hyperparameter tuning to see how much fitness we can squeeze out of the strategy.

This task is done when an analysis shows that we can train/test a sklearn model for detecting damage using current features *and* hash vector features.

Event Timeline

There were few discussions over email. I'm pasting that here for future reference. It is in reverse chronological order.

On Mon, May 9, 2016 at 9:19 AM, Sabyasachi Ruj <> wrote:

Hey Aaron,

After including other features using hstack(other_features, hv_featuers), the new scores are as following:

Correct Predictions: 3156 out of 3735 (84.9%)
average precision score: 0.357629577841 
roc auc score: 0.913897800403

On Thu, May 5, 2016 at 9:50 PM, Aaron Halfaker <> wrote:

Yeah. That's what I was imagining. I'm not sure how the feature sampling strategy will work within the GradientBoosting model. But it will be interesting to see if the GB model ever ends up selecting many of the 72 features when it's sampling from a pool of millions.

On Thu, May 5, 2016 at 10:30 AM, Sabyasachi Ruj <> wrote:

Oops. Only HV features. I assumed we will combine with other features if HV is giving good result.

Should these 72 other features become columns to HV's 2**20 columns? And then build the model?

On May 5, 2016 7:56 PM, "Aaron Halfaker" <> wrote:
Hey Sabya,

When you build this model and get these stats, are you also using all of the other features we had extracted or just the hashingvectorizer's features?


On Wed, May 4, 2016 at 11:15 PM, Sabyasachi Ruj <> wrote:

Hey Aaron,

I got the scores you wanted. Here are they:

average precision score: 0.0929607869184
roc auc score: 0.715005282874

What gives?

On Wed, May 4, 2016 at 9:39 AM, Sabyasachi Ruj <> wrote:

I'm still going through PR-AUC, ROC-AUC. Will try to get it in 1-2 days.

In the meantime will this confusion matrix help?


On Mon, May 2, 2016 at 10:02 AM, Sabyasachi Ruj <> wrote:

Hey Aaron,

Few updates:

  • I have pushed the code to revscoring repo, poc_hashing_vector branch [1]
  • I have copied the data to sabya-precached.ores-staging.eqiad.wmflabs /srv/revscoring_hv/data. i) data.db has the tsv file migrated in observations table. content table has downloaded data for each revid and it's parent.
  • features.db should be created by running extract_features()
  • I'll create option switches for running the python script in a while.
  • I'm unable to run score function when I convert the sparse matrix to dense using features.todense, it is running out of memory.
  • For the above reason, I have created a "crude iterative scoring" function: score_model_iterative. What this does is it calculated the percentage of correct predictions. Using this method, I could see 83% predictions are correct. Is that a good number to start with?
  • Now I will look into including skipgram



On Sat, Apr 30, 2016 at 8:26 AM, Sabyasachi Ruj <> wrote:

Hey Aaron,

I've built the model with (uni|bi|tri)gram. Now need to see score :) Not sure about how to consider skipgram though. Did not see any option parameter for this in HashingVectorizer. Do we need to build custom analyzer?

Awesome! Thanks for the dump. Here's my email response.

With our currently deployed model, we get ROC-AUC of 0.914 and average precision score of 0.463. So it looks like we're losing out.

I suspect that the high-signal features are drowning in a pool of mostly low-signal hashvector features. This is something that @JustinOrmont warned me about when suggesting we start experimenting with this strategy. I bet that some model tuning would help mitigate this issue.

How are you parameterizing the GradientBoostingClassifier model when constructing it? I think that it would be worthwhile to try large values for n_estimators and small values for learning_rate. I'm not sure what kind of effect this will have so it will certainly be an experiment.

Also worth noting is that I put together a "grammer" that will create any abitrary ngram or skipgram efficiently in python. See the gist of

How are you parameterizing the GradientBoostingClassifier model when constructing it? I think that it would be worthwhile to try large values for n_estimators and small values for learning_rate. I'm not sure what kind of effect this will have so it will certainly be an experiment.

I am using all default params [1]. That is: gbc = GradientBoostingClassifier(). The default values of n_estimators is 100 and learning rate is 0.1[2].

I'll try with n_estimators=200 and learning_rate = 0.05 and see.


Here are the new scores with n_estimators=200 and learning_rate = 0.05

Correct Predictions: 3157 Total Predictions: 3735 Score: 84.524765729585
average precision score: 0.355303317372
roc auc score: 0.914230141197

My apologies for the delay.

I'd look at stacking your models. Take the output of the hashing trick model(s) as additional features to a model w/ the strong features included.

The general idea is to help the model pick hand created features more often instead of the thousands of features from the hashing trick. I haven't seen an example, but it would be nice if you could assign probabilities of picking each features for the boosted decision tree. As you likely know random forests & boosted decision tree chose a subset of the features & training samples for each new tree.

I'm not sure if you'd tried differing parameters, but I'd try some parameter ranges:
Num of trees to train (50 - 200), % of features to use on each tree (~sqrt(N) to 0.1*N), % of training samples for each tree (0.1*N to 0.9*N), stopping criteria { min # of samples per node (~10), required information gain / impurity before splitting (doesn't seem available), max tree depth (~3 to 30), etc }, learning speed (~0.05 to 0.5).

Are you hitting overfitting territory? Aka, much better performances on training set than test set.

What's your confusion matrix look like?

Can also play in easy nlp feature engineering territory: skipgrams, stemming & phonetic algorithms (soundex, metaphone); more training data will likely win first though.

In summary, I'd try stacking, finding good parameters & a larger training set.

@JustinOrmont Thank you for the detailed input, appreciate :)

I'll work on your suggestions. It will take a while as I need reading up about them.

@JustinOrmont, @Halfak: I have few questions:

Stacking: Any idea how to make progress on this?

Pick hand created features more often: Any idea how to make some progress on this?.

Finding good parameters: is there an efficient way to do this? If I change one parameter at a time, it will take a long time to collect all the scores due to the sheer number of combinations. Is there a best practice to find parameters?

Larger training set: We had around 16K samples out of that around 600 were is_damaging. Should I try even larger training set? How large? @Halfak: how to get a larger set?

Stacking: You know how you built a model just based on the vectorizer features first. Use that models predict_proba(...) true probability as a single feature in a secondary classifier that uses the other 70ish features.

Pick hand created features more often: I think that stacking is a good wait to try to do this. I'm not sure of another way. One way that might work to do this is to use train a model on the Vectorized features and then use the feature_importances_member to identify the highest weighted features.

Finding good parameters: Check out the revscoring tune utility. If you have revscoring installed, run revscoring tune -h. Heres how we do model tuning for our edit quality models: For more discussion, see and

Larger training set: We can train on reverted edits. I'll look into producing a very large training set for that.

Halfak triaged this task as Low priority.Jul 5 2016, 2:34 PM