The thing I was thinking about was to check English bad words on the comments.
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • johl | T127047 Collection of topics for HPI hackathon | |||
Resolved | Lydia_Pintscher | T127473 Increase signal of feature set for Wikidata model | |||
Resolved | Halfak | T171505 Late-July 2017 ORES deploy | |||
Resolved | Ladsgroup | T162617 Use 'informals', 'badwords', etc. in Wikidata feature set | |||
Resolved | Ladsgroup | T170834 Add basic bad word check to Wikidata feature set |
Event Timeline
Comment Actions
New models built with adding bad words for English:
ScikitLearnClassifier - type: GradientBoosting - params: balanced_sample=false, warm_start=false, loss="deviance", scale=true, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=7, n_estimators=700, min_samples_split=2, random_state=null, center=true, max_leaf_nodes=null, criterion="friedman_mse", presort="auto", max_features="log2", verbose=0, subsample=1.0, learning_rate=0.01, min_impurity_split=1e-07, balanced_sample_weight=true, init=null - version: 0.3.0 - trained: 2017-07-20T07:47:20.590061 Table: ~False ~True ----- -------- ------- False 21084 615 True 149 2494 Accuracy: 0.969 Precision: ----- ----- False 0.993 True 0.803 ----- ----- Recall: ----- ----- False 0.972 True 0.943 ----- ----- PR-AUC: ----- ----- False 0.994 True 0.894 ----- ----- ROC-AUC: ----- ----- False 0.986 True 0.991 ----- ----- Recall @ 0.1 false-positive rate: label threshold recall fpr ------- ----------- -------- ----- False 0.161 0.983 0.095 True 0.112 0.987 0.094 Filter rate @ 0.9 recall: label threshold filter_rate recall ------- ----------- ------------- -------- False 0.9 0.196 0.9 True 0.851 0.887 0.902 Filter rate @ 0.75 recall: label threshold filter_rate recall ------- ----------- ------------- -------- False 0.984 0.331 0.75 True 0.965 0.908 0.752 Recall @ 0.995 precision: label threshold recall precision ------- ----------- -------- ----------- False 0.656 0.956 0.995 True 0.987 0.049 1 Recall @ 0.99 precision: label threshold recall precision ------- ----------- -------- ----------- False 0.255 0.981 0.99 True 0.987 0.049 1 Recall @ 0.98 precision: label threshold recall precision ------- ----------- -------- ----------- False 0.057 0.986 0.981 True 0.986 0.09 0.994 Recall @ 0.9 precision: label threshold recall precision ------- ----------- -------- ----------- False 0.015 0.999 0.903 True 0.954 0.561 0.902 Recall @ 0.75 precision: label threshold recall precision ------- ----------- -------- ----------- False 0.012 1 0.893 True 0.4 0.957 0.761 Recall @ 0.6 precision: label threshold recall precision ------- ----------- -------- ----------- False 0.012 1 0.893 True 0.16 0.982 0.616 Recall @ 0.45 precision: label threshold recall precision ------- ----------- -------- ----------- False 0.012 1 0.893 True 0.067 0.994 0.483 Recall @ 0.15 precision: label threshold recall precision ------- ----------- -------- ----------- False 0.012 1 0.893 True 0.009 1 0.202
Improvements are small. The most significant thing I saw was 5% increase in recall at 90% precision.