Page MenuHomePhabricator

Add entropy-related and uppercase-related measures to comments
Closed, ResolvedPublic

Description

This will be extremely useful in case someone adds "LOLLL" as a term.

Event Timeline

How would this measurement look in practice?

Something like UPPERCASE_CHARS / all_chars?

It will be one of the features, but things like "Longest repeated character" and "compression ratio" would be useful too

Old model without these four features:

ScikitLearnClassifier
 - type: GradientBoosting
 - params: learning_rate=0.01, max_features="log2", max_depth=7, loss="deviance", random_state=null, criterion="friedman_mse", balanced_sample=false, presort="auto", min_impurity_split=1e-07, min_weight_fraction_leaf=0.0, scale=true, center=true, verbose=0, max_leaf_nodes=null, subsample=1.0, balanced_sample_weight=true, n_estimators=700, warm_start=false, init=null, min_samples_split=2, min_samples_leaf=1
 - version: 0.3.0
 - trained: 2017-07-18T18:08:23.587697

Table:
	         ~False    ~True
	-----  --------  -------
	False      1222      400
	True        201     2442

Accuracy: 0.859
Precision:
	-----  -----
	False  0.856
	True   0.859
	-----  -----

Recall:
	-----  -----
	False  0.751
	True   0.924
	-----  -----

PR-AUC:
	-----  -----
	False  0.849
	True   0.894
	-----  -----

ROC-AUC:
	-----  -----
	False  0.884
	True   0.885
	-----  -----

Recall @ 0.1 false-positive rate:
	label      threshold    recall    fpr
	-------  -----------  --------  -----
	False          0.416     0.767  0.096
	True           0.814     0.528  0.096

Filter rate @ 0.9 recall:
	label      threshold    filter_rate    recall
	-------  -----------  -------------  --------
	False          0.187          0.364     0.903
	True           0.597          0.354     0.902

Filter rate @ 0.75 recall:
	label      threshold    filter_rate    recall
	-------  -----------  -------------  --------
	False          0.541          0.658     0.753
	True           0.739          0.469     0.751

Recall @ 0.995 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.972     0.083            1
	True           0.941     0.043            1

Recall @ 0.99 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.972     0.083            1
	True           0.941     0.043            1

Recall @ 0.98 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.972     0.083            1
	True           0.941     0.043            1

Recall @ 0.9 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.697     0.667        0.909
	True           0.801     0.547        0.903

Recall @ 0.75 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.338     0.79         0.764
	True           0.075     0.978        0.759

Recall @ 0.6 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.227     0.858        0.609
	True           0.026     1            0.633

Recall @ 0.45 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.147     0.949        0.457
	True           0.026     1            0.631

Recall @ 0.15 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.052         1        0.386
	True           0.026         1        0.631

With these four features added:

ScikitLearnClassifier
 - type: GradientBoosting
 - params: max_features="log2", min_samples_leaf=1, subsample=1.0, scale=true, balanced_sample_weight=true, verbose=0, criterion="friedman_mse", min_samples_split=2, max_leaf_nodes=null, center=true, n_estimators=700, min_weight_fraction_leaf=0.0, max_depth=7, balanced_sample=false, init=null, min_impurity_split=1e-07, warm_start=false, presort="auto", learning_rate=0.01, random_state=null, loss="deviance"
 - version: 0.3.0
 - trained: 2017-07-18T18:14:29.705559

Table:
	         ~False    ~True
	-----  --------  -------
	False      1227      395
	True        192     2451

Accuracy: 0.862
Precision:
	-----  -----
	False  0.863
	True   0.861
	-----  -----

Recall:
	-----  -----
	False  0.754
	True   0.928
	-----  -----

PR-AUC:
	-----  -----
	False  0.857
	True   0.898
	-----  -----

ROC-AUC:
	-----  -----
	False  0.888
	True   0.891
	-----  -----

Recall @ 0.1 false-positive rate:
	label      threshold    recall    fpr
	-------  -----------  --------  -----
	False          0.427     0.773  0.096
	True           0.814     0.522  0.097

Filter rate @ 0.9 recall:
	label      threshold    filter_rate    recall
	-------  -----------  -------------  --------
	False          0.186          0.36      0.903
	True           0.59           0.356     0.902

Filter rate @ 0.75 recall:
	label      threshold    filter_rate    recall
	-------  -----------  -------------  --------
	False          0.552          0.657     0.753
	True           0.746          0.475     0.751

Recall @ 0.995 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.971     0.085            1
	True           0.939     0.059            1

Recall @ 0.99 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.971     0.085            1
	True           0.939     0.059            1

Recall @ 0.98 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.967     0.146        0.995
	True           0.939     0.059        1

Recall @ 0.9 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.673     0.669        0.906
	True           0.795     0.576        0.903

Recall @ 0.75 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.329     0.807        0.756
	True           0.075     0.98         0.76

Recall @ 0.6 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.222     0.869        0.613
	True           0.029     0.999        0.638

Recall @ 0.45 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.138     0.955        0.454
	True           0.027     1            0.632

Recall @ 0.15 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.05          1        0.386
	True           0.027         1        0.632

my pip freeze in case the numbers are weird:

(p3)ladsgroup@ores-misc-01:~/editquality$ pip freeze
PyYAML==3.12
certifi==2017.4.17
chardet==3.0.4
docopt==0.6.2
docutils==0.13.1
editquality==0.4.2
idna==2.5
jsonable==0.3.1
more-itertools==3.2.0
mwapi==0.5.1
mwparserfromhell==0.4.4
mwreverts==0.0.6
mwtypes==0.2.0
mysqltsv==0.0.7
nltk==3.0.5
nose==1.3.7
numpy==1.13.1
para==0.0.5
pyenchant==1.6.9
pytz==2017.2
pywikibase==0.0.4
requests==2.18.1
revscoring==1.3.18
scikit-learn==0.18.2
scipy==0.19.1
six==1.10.0
sklearn==0.0
statistics==1.0.3.5
tabulate==0.7.7
urllib3==1.21.1
yamlconf==0.2.3

This is the new one with proper data:

ScikitLearnClassifier
 - type: GradientBoosting
 - params: balanced_sample_weight=true, verbose=0, subsample=1.0, max_depth=7, min_samples_split=2, warm_start=false, init=null, max_features="log2", balanced_sample=false, random_state=null, center=true, min_impurity_split=1e-07, scale=true, learning_rate=0.01, max_leaf_nodes=null, n_estimators=700, min_weight_fraction_leaf=0.0, min_samples_leaf=1, loss="deviance", criterion="friedman_mse", presort="auto"
 - version: 0.3.0
 - trained: 2017-07-19T10:13:13.400962

Table:
	         ~False    ~True
	-----  --------  -------
	False     21082      617
	True        149     2494

Accuracy: 0.969
Precision:
	-----  -----
	False  0.993
	True   0.802
	-----  -----

Recall:
	-----  -----
	False  0.972
	True   0.943
	-----  -----

PR-AUC:
	-----  -----
	False  0.994
	True   0.889
	-----  -----

ROC-AUC:
	-----  -----
	False  0.986
	True   0.991
	-----  -----

Recall @ 0.1 false-positive rate:
	label      threshold    recall    fpr
	-------  -----------  --------  -----
	False          0.15      0.984  0.093
	True           0.106     0.988  0.095

Filter rate @ 0.9 recall:
	label      threshold    filter_rate    recall
	-------  -----------  -------------  --------
	False          0.901          0.196     0.9
	True           0.867          0.887     0.902

Filter rate @ 0.75 recall:
	label      threshold    filter_rate    recall
	-------  -----------  -------------  --------
	False          0.985          0.331     0.75
	True           0.965          0.907     0.752

Recall @ 0.995 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.667     0.954        0.995
	True           0.987     0.047        1

Recall @ 0.99 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.239     0.981        0.991
	True           0.987     0.047        1

Recall @ 0.98 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.055     0.986        0.981
	True           0.987     0.047        1

Recall @ 0.9 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.016     0.999        0.903
	True           0.957     0.515        0.903

Recall @ 0.75 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.012     1            0.894
	True           0.398     0.957        0.759

Recall @ 0.6 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.012     1            0.894
	True           0.158     0.982        0.613

Recall @ 0.45 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.012     1            0.894
	True           0.064     0.995        0.481

Recall @ 0.15 precision:
	label      threshold    recall    precision
	-------  -----------  --------  -----------
	False          0.012         1        0.894
	True           0.014         1        0.261

There was a bug that I fixed it.

Ladsgroup moved this task from Incoming to Done on the User-Ladsgroup board.Jul 19 2017, 6:13 PM
Halfak closed this task as Resolved.Jul 20 2017, 7:15 PM