Page MenuHomePhabricator

Add basic bad word check to Wikidata feature set
Closed, ResolvedPublic

Description

The thing I was thinking about was to check English bad words on the comments.

Event Timeline

New models built with adding bad words for English:

ScikitLearnClassifier
 - type: GradientBoosting
 - params: balanced_sample=false, warm_start=false, loss="deviance", scale=true, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=7, n_estimators=700, min_samples_split=2, random_state=null, center=true, max_leaf_nodes=null, criterion="friedman_mse", presort="auto", max_features="log2", verbose=0, subsample=1.0, learning_rate=0.01, min_impurity_split=1e-07, balanced_sample_weight=true, init=null
 - version: 0.3.0
 - trained: 2017-07-20T07:47:20.590061

Table:
	         ~False    ~True
	-----  --------  -------
	False     21084      615
	True        149     2494

Accuracy: 0.969
Precision:
	-----  -----
	False  0.993
	True   0.803
	-----  -----

Recall:
	-----  -----
	False  0.972
	True   0.943
	-----  -----

PR-AUC:
	-----  -----
	False  0.994
	True   0.894
	-----  -----

ROC-AUC:
	-----  -----
	False  0.986
	True   0.991
	-----  -----

Recall @ 0.1 false-positive rate:
	label      threshold    recall    fpr
	-------  -----------  --------  -----
	False          0.161     0.983  0.095
	True           0.112     0.987  0.094

Filter rate @ 0.9 recall:
	label      threshold    filter_rate    recall
	-------  -----------  -------------  --------
	False          0.9            0.196     0.9
	True           0.851          0.887     0.902

Filter rate @ 0.75 recall:
	label      threshold    filter_rate    recall
	-------  -----------  -------------  --------
	False          0.984          0.331     0.75
	True           0.965          0.908     0.752

Recall @ 0.995 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.656     0.956        0.995
	True           0.987     0.049        1

Recall @ 0.99 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.255     0.981         0.99
	True           0.987     0.049         1

Recall @ 0.98 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.057     0.986        0.981
	True           0.986     0.09         0.994

Recall @ 0.9 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.015     0.999        0.903
	True           0.954     0.561        0.902

Recall @ 0.75 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.012     1            0.893
	True           0.4       0.957        0.761

Recall @ 0.6 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.012     1            0.893
	True           0.16      0.982        0.616

Recall @ 0.45 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.012     1            0.893
	True           0.067     0.994        0.483

Recall @ 0.15 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.012         1        0.893
	True           0.009         1        0.202

Improvements are small. The most significant thing I saw was 5% increase in recall at 90% precision.

I saw was 5% increase in recall at 90% precision.

That's pretty good.