Page MenuHomePhabricator

[Spike] Experiment with using bag-of-words badwords features and general NLP strategies.
Closed, ResolvedPublic


[Revitalized by aetilley on 30 October]

Discuss/experiment the use of each badword as a feature (instead of the whole list as a single feature), as in the Machine Learning course (on Coursera).

WIP badwords as features
Discussion on meta

See for discussion

Event Timeline

Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Active on the Machine-Learning-Team (Active Tasks) board.
Halfak added a subscriber: Halfak.
He7d3r set Security to None.
He7d3r awarded a token.

@He7d3r would you be interested in picking up this card soon?

I assigned this task to @He7d3r because he is/was making progress on it. :)

aetilley renamed this task from Experiment with using bag-of-words badwords features to [Spike] Experiment with using bag-of-words badwords features.Oct 30 2015, 5:51 PM
aetilley claimed this task.
aetilley updated the task description. (Show Details)
aetilley moved this task from Paused to Backlog on the Machine-Learning-Team (Active Tasks) board.
aetilley renamed this task from [Spike] Experiment with using bag-of-words badwords features to [Spike] Experiment with using bag-of-words badwords features and general NLP strategies..Nov 20 2015, 6:22 PM
aetilley moved this task from Paused to Backlog on the Machine-Learning-Team (Active Tasks) board.

Using large feature sets requires very large datasets to be effective, and the more subtle the content that you're trying to extract (e.g. "sneaky vandalism") the more difficult it is to extract this content from an editor's word choice.

If we're going to go full NLP, we'll need lots and lots of data.

A proposal:

We Do have a very large dataset (all labeled and unlabeled revisions) but a relatively small set of labeled data (revisions marked as (un)reverted or wiki-labeled as (non)damaging or good/bad faith.

Solution? Apply semi-supervised learning.

In [1], Nigam et. al. apply the EM algorithm to a large dataset of articles, only a small fraction of which are labeled (as belonging to one of 20 newsgroups). They are able to obtain a high-accuracy classifier.

How EM works (and how unlabelled data can be useful) is beyond the scope of this email, but see the paper [1].

One main difference on our case is that, in [1], the input is a document and the features are the occurrences of words. We would most likely have as inputs diffs/revisions with our features n-grams added/removed (for some small n).

Also in [1] the authors are attempting to infer a specific newsgroup (e.g. *Sports*) from the presence of words (e.g. "baseball"), which, again, on its face seems easier than inferring, say, good/bad-faith from the removal/insertion of various n-grams.

On the other hand we have much more data than the authors of [1].

Even if this approach turns out not to work for NLP features, it might give us valuable information with out current feature set.

Either way, I would be willing to implement this.

[1] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 103-134, 2000

Looked at two more papers.

The first paper was

Vandalism Detection in Wikipedia: a Bag-of-Words Classifier Approach ( by Amit Belani)

This provided some valuable insight (see notes below), but we won't follow this route at this time since currently too difficult to add large families (in the thousands) of new features to our models.

Belani (2009):

Belani examines words not just in article but also in editorial comment (and ip of editor)

Target class is vandalism or not.

Data used: pages-meta-history.xml.bz2

Only considered anonymous edits.

Merge multiple successive by same editor into one edit (but record how many consecutive edits were merged and use this integer as a feature).

Included html/wiki markup in word list.

For each word in global word set (see below), add a feature for both difference and ratio of number of occurrences.

Neither TFiDF nor word stemming were used.

Ancillary features:

(Bool) Edit Empty?

(Bool) Comment Entered?

(Bool) Marked Minor Edit?

(Bool) Edit was change to External Links?

(Int) First four octets of IPv4 of anon editor.

(Int) N-1 where N is the number of consecutive edits by same editor

(Int) num chars and num words in previous and current revision and their differences.

Three squashing functions were considered for mapping feature values into interval [0,1].

arctan(x) * 2/pi
min [ ln(x+1) / ln(max(X) + 1) , 1] (X is max feature value on training set
Note that for 2, we don’t need both difference and ratio features.

Quoting Belani:

"2,000,000 cases from the years 2001-08 were extracted from articles starting with the letters A to M. 1,585,397 unique words and an average of 104 words per case were extracted. Because additions and subtractions of words were processed as separate features, the number of features in the dataset is at least twice the number of unique words, or 3,170,794. Additionally, because atan and log-lin scaled datasets also include word ratio features, the number of features in them is at least four times the number of unique words, or 6,341,588. "

Uses Logistic Regression (mentions that linear SVM could work).

Interestingly, the best performing squashing function was the binary function.

But to reiterate, any way that we decide to do this, we will have to make sense of *parametrized families* of features in our code, since we can’t define thousands of features individually.

The second paper was

Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis
(Manoj Harpalani, Michael Hart, Sandesh Singh, Rob Johnson, and Yejin Choi)

"In this work, we explore the use of PCFG models for vandalism detection, by viewing the task as a genre detection problem, where a group of authors share similar linguistic behavior.”

The authors hypothesize that regular and vandalism edits will exhibit not only lexical differences but “stylometric” differences as well (which includes not just lexical properties, but grammatical/syntactic and stylistic patterns). If this is true, they reason, is should be detectable by examining the behavior of two PCFGs (Probabilistic Context Free Grammars) trained on regular and vandalism wikipedia edits.

They give Raghavan et al as their primary source for this use of PCFGs to detect stylometric differences:
(Sindhu Raghavan, Adriana Kovashka, and Raymond Mooney. 2010. Authorship attribution using proba- bilistic context-free grammars. In Proceedings of the ACL, pages 38–42, Uppsala, Sweden, July. Associa- tion for Computational Linguistics.)

In short, PCFGs extent the basic formulation of CFGs by stipulating (or inferring from training) a scoring function q, such that for each nonterminal symbol X, the function Q(Y) := q(X -> Y) is a probability distribution over the space of all strings Y of terminals and nonterminals.

Typical examples of rules X-->Y
VERBPHRASE --> swims


Given this q (e.g. given a trained PCFG) we can then assign a probability p(T) to any parse tree or p(s) to any string in our language. Further, the act of parsing a string s is reduced to computing the argmax_{T a tree yeilding s} p(T).

Once we have trained two PCFGs on regular and vandalism edits, classification reduces, in the simplest case, to taking the maximum probability yielded by the two PCFGs, although more typically we will want to use, say, the difference in the two probabilities as a feature value in a classifier.

The authors aim to compare models trained on features other than those derived from PCFGs to models that have both types of features. They categorize the features considered into four groups and mention some (but not all) of the features from each category:

  1. Metadata:

If Registered user:
For how long?
How many previous acts of vandalism?
How frequently do they edit?
Characteristics of the particular article's revision History:
How many times previously vandalized?
Last time edited?
How many times reverted?

  1. Lexical Cues:

"Edit distance" (I'm guessing similar to our bytes_changed)
Number of repeated patterns,
Number of slang words, vulgarities, and pronouns? (This last seems new)
Type of edit (insert vs. modification vs. delete)

  1. Sentiment:

Positive and Negative sentences inserted
overall sentiment score for diff.
Number of inserted or deleted subjective sentences

(They don't go into detail here other than to mention that this is based on the work of
Pang and Lee
Bo Pang and Lillian Lee. 2004. A sentimental edu- cation: sentiment analysis using subjectivity summa- rization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Compu- tational Linguistics, ACL ’04, Stroudsburg, PA, USA. Association for Computational Linguistics.

  1. Stylometric:

"For our PCFG, we take the difference between the minimum log-likelihood score (i.e. the sentences with the minimum log-likelihood) of Cvandal and Cregular, the difference in the max- imum log-likelihood score, the difference in the mean log-likelihood score, the difference in the standard deviation of the mean log-likelihood score and the difference in the sum of the log-likelihood scores."

(Here, C_vandal and C_regular are the PCFGs trained on vandal and regular edits.

The Classifier:

"We use Weka’s (Hall et al., 2009) implementation of LogitBoost (Friedman et al., 2000) to perform the classification task. We use Decision Stumps (Ai and Langley, 1992) as the base learner and run Logit- Boost for 500 iterations. We also discretize the train- ing data using the Multi-Level Discretization tech- nique (Perner and Trautzsch, 1998)."

My limited research shows that LogitBoost is an extension of Adaboost, which I more or less understand. But presumably this could all be down with our SVM/RFs.


They test with the Pan2010 vandalism corpus. They compare a classifier based on ONLY features in categories 1-3 above to a classifier with PCFG features as well.

Addition of PCFG features gave an additional 1.3 AUC. (from 91.6 to 92.9)

Again, the paper does not list all features used, but they do give a list of the top 10 features based on the "InfoGain" metric. The top four pertain to author metadata, but the 5th and 6th ranked features are the differences in the maximum and mean PCFG scores (respectively). Also interesting is that "vulgarisms" and other "shallow lexico-syntactic pattern" features ranked very low on the list (in particular they do not make the top 10).


A major difference of this second approach is that the introduction of PCFGs only leads to the addition of a handful of new features to our models.

But that means every time we do feature extraction raw datasource, say, to train our main model or to score some revision, we have to compute, say:

"the difference between the minimum log-likelihood score (i.e. the sentences with the minimum log-likelihood) of Cvandal and Cregular,"

min_{s \in revision}(log(p_{vandal}(s))) - min_{s \in revision}(log(p_{regular}(s)))

Here the mins are over each sentence in the revision,
p_{vandal} is the probability according to the vandalism grammar (PCFG), and
p_{regular} is the probability according to the regular grammar (PCFG).

where these grammars will be trained ahead of time.

We might even add analogous features for the parent revision.

I have begun work on a pcfg_scorer

inspired by Yu Usami's

Note that Usami uses the CKY algorithm to parse, whereas we use the Inside algorithm to score, but the PCFG object is more or less the same in both cases.