Train a `reverted` model for jawiki
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Mar 24 2016, 7:59 PM

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Halfak	T130869 Train a `reverted` model for jawiki
		Open		None	T133405 [research] Why is the japanese 'reverted' model so bad?

Event Timeline

Halfak created this task.Mar 24 2016, 7:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 24 2016, 7:59 PM

Halfak moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.Mar 24 2016, 7:59 PM

Halfak removed Halfak as the assignee of this task.Apr 5 2016, 6:05 PM

Halfak moved this task from Backlog to Parked on the Machine-Learning-Team (Active Tasks) board.

Ladsgroup claimed this task.Apr 9 2016, 5:30 PM

Ladsgroup moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.

Strangely LogisticRegression came up the best (and with 1% difference GradientBoosting). since we don't have LR in revscoring yet, I made the GB model:

models/jawiki.reverted.gradient_boosting.model
2016-04-10 06:32:47,931 INFO:revscoring.utilities.train_test -- Training model...
2016-04-10 06:32:51,275 INFO:revscoring.utilities.train_test -- Testing model...
ScikitLearnClassifier
 - type: GradientBoosting
 - params: min_weight_fraction_leaf=0.0, presort="auto", n_estimators=700, center=true, warm_start=false, scale=true, subsample=1.0, min_samples_split=2, loss="deviance", balanced_sample_weight=true, verbose=0, min_samples_leaf=1, learning_rate=0.1, balanced_sample=false, max_depth=1, init=null, random_state=null, max_leaf_nodes=null, max_features="log2"
 - version: 0.0.1
 - trained: 2016-04-10T06:32:51.272178

Table:
                 ~False    ~True
        -----  --------  -------
        False      2894     1013
        True         17       50

Accuracy: 0.741
Precision: 0.047
Recall: 0.746
PR-AUC: 0.14
ROC-AUC: 0.782
Recall @ 0.1 false-positive rate: threshold=0.985, recall=0.015, fpr=0.0
Filter rate @ 0.9 recall: threshold=0.194, filter_rate=0.28, recall=0.91
Filter rate @ 0.75 recall: threshold=0.49, filter_rate=0.722, recall=0.761

AUC is not bad but not very good either.

I think the reason is that we don't have dictionary for ja

https://github.com/wiki-ai/editquality/pull/24

Ladsgroup moved this task from Backlog to Review on the Machine-Learning-Team (Active Tasks) board.Apr 10 2016, 6:42 AM

Halfak added a subtask: T133405: [research] Why is the japanese 'reverted' model so bad?.Apr 25 2016, 4:30 PM

Moving this to the backlog. Still looking for a Japanese speaker to help us review our dataset for training/testing. @Johan, maybe you could help us with this during the workshop at Wikimania? See T134628