Here's my first results.
- Number of reverted edits are super small (382) which means we are prone to overfitting.
Let me increase number of edits to 40K increase radius to 7 revision. We might get something.
With 40K edits:
ScikitLearnClassifier - type: GradientBoosting - params: balanced_sample=false, max_leaf_nodes=null, center=true, presort="auto", learning_rate=0.01, init=null, verbose=0, min_samples_leaf=1, n_estimators=700, max_features="log2", balanced_sample_weight=true, random_state=null, scale=true, max_depth=7, warm_start=false, loss="deviance", min_samples_split=2, subsample=1.0, min_weight_fraction_leaf=0.0 - version: 0.0.1 - trained: 2016-05-21T13:00:24.789069 Table: ~False ~True ----- -------- ------- False 6710 1055 True 99 97 Accuracy: 0.855 Precision: 0.084 Recall: 0.495 PR-AUC: 0.122 ROC-AUC: 0.804 Recall @ 0.1 false-positive rate: threshold=0.963, recall=0.005, fpr=0.0 Filter rate @ 0.9 recall: threshold=0.129, filter_rate=0.505, recall=0.903 Filter rate @ 0.75 recall: threshold=0.294, filter_rate=0.708, recall=0.75
It sounds much better. Let's see if we need to do other stuff
Another thing I learned from Japanese Wikipedia is that number of unregistered users doing good edit is much more than other wikis. Just checkout their RC. Causing features such as user age loose their predictive value. And since we can't get much signal from Japanese text (no dict words, etc.), Our models won't as good as we want unless we add features exclusively for Japanese Wikipedia/Japanese language.
@Elitre I don't fully understand what you need. Is it something to do with the message by とある白い猫 posted to Japanese Wikipedia about "Research:Revision scoring as a service"?
Or do you need a bad word list like below?
@Miya: Hey, We are working on building anti-vandalism tools for Japanese Wikipedia using AI (for example see ORES in beta features in Wikidata). What we need right now is someone with knowledge of Japanese language to tell us how many of edits linked in T133405#2322879 (and marked as bad) are bad and how many of the edits marked as good are actually good. So we know about our false positives and we improve them. Please feel free to ask if anything is unclear.
We just got pinged to re-consider this here: https://www.mediawiki.org/wiki/Topic:Ub6ir6tww9z81960
@Ladsgroup, from my skimming of the page and notes, it seems like the model is good enough to deploy. How did you generate this data?