Page MenuHomePhabricator

Audit deployed editquality models and figure out why if the models are bad
Open, LowPublic

Description

Also make new labeling campaigns if it's too old.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I wrote a script that goes through all models and checks their ROC AUC and sort them based on worst to the best. This is the result:

1wiki model roc_auc_micro
2fiwiki damaging 0.827
3plwiki damaging 0.841
4kowiki reverted 0.874
5ukwiki reverted 0.88
6ruwiki wp10 0.889
7frwikisource pagelevel 0.891
8hewiki damaging 0.894
9itwiki reverted 0.902
10fiwiki goodfaith 0.902
11plwiki goodfaith 0.902
12frwiki damaging 0.904
13tawiki reverted 0.905
14dewiki reverted 0.907
15trwiki wp10 0.91
16elwiki reverted 0.914
17cswiki damaging 0.919
18frwiki wp10 0.921
19hrwiki reverted 0.921
20simplewiki damaging 0.924
21eswiki damaging 0.924
22enwiki damaging 0.924
23ptwiki damaging 0.924
24simplewiki goodfaith 0.925
25enwiki goodfaith 0.925
26ruwiki damaging 0.925
27ruwiki goodfaith 0.928
28bnwiki reverted 0.928
29ptwiki goodfaith 0.931
30frwiki goodfaith 0.933
31eswiki goodfaith 0.935
32arwiki damaging 0.936
33sqwiki goodfaith 0.938
34eswikiquote reverted 0.939
35trwiki goodfaith 0.94
36simplewiki wp10 0.941
37trwiki damaging 0.941
38enwiki wp10 0.941
39huwiki damaging 0.943
40iswiki reverted 0.946
41sqwiki damaging 0.951
42idwiki reverted 0.953
43hewiki goodfaith 0.956
44viwiki reverted 0.957
45nlwiki damaging 0.957
46rowiki damaging 0.958
47rowiki goodfaith 0.959
48eswikibooks damaging 0.96
49fawiki goodfaith 0.961
50fawiki damaging 0.962
51cswiki goodfaith 0.963
52etwiki damaging 0.963
53nlwiki goodfaith 0.97
54nowiki reverted 0.972
55wikidatawiki goodfaith 0.972
56wikidatawiki itemquality 0.974
57cawiki damaging 0.976
58svwiki damaging 0.977
59svwiki goodfaith 0.977
60etwiki goodfaith 0.978
61lvwiki damaging 0.979
62arwiki goodfaith 0.979
63enwiktionary reverted 0.981
64eswikibooks goodfaith 0.982
65simplewiki draftquality 0.983
66enwiki draftquality 0.983
67wikidatawiki damaging 0.986
68huwiki goodfaith 0.987
69lvwiki goodfaith 0.991
70cawiki goodfaith 0.992
71testwiki damaging 0.996
72testwiki goodfaith 0.996
73testwiki reverted 0.996

Now it's time to think what should we do with fiwiki, plwiki, hewiki damaging models :/

@Zache, @eranroz, and @Wargo, how are ORES damage detection models working on your wikis? Our stats suggest they are not very accurate, but we want to know about your experiences with them.

On hewiki ORES damaging model works OK, but there is still a room for improvement.
I think it is somewhat too conservative (by too conservative I mean FN, damaging edits that get low probability of ~0.4-0.5).

Currently both goodfaith and damaging are giving weighted scores (to good end). For example, goodfaith is pretty much always 0.95 or better. Same is true with the damaging too but weighting seems not to be that bad.

However, the weighting seems to be systematic and if we ignore that the damaging model is currently better for detecting actual damage than what it was in summer 2017 when it was more biased against IP-editors. Damaging model is currently also better than goodfaith.

As for practical use, my seulojabot currently approves popular culture edits if ORES goodfaith true is >0.95 and damaging true <0.15 and non-BLP edits if goodfaith true >0.99 and damaging true <0.015 and those limits seems to be unproblematic.

Ping @4shadoww , how about stabilizerbot stats?

I think the models have improved from the summer 2017. From my experience the damaging model doesn't do much if any false positives, which I think is a good thing. But in the other hand it does a lot of false negatives. So it doesn't detect damaging edits properly. Far as I know goodfaith model is just not very accurate as it makes both false positives and negatives.

Stabilizerbot makes mostly mistakes because of some other method of detecting harmful edits than ores. Ores seems to make pretty rarely false positives as the bot requires both models goodfaith and damaging scores to be true < 0.15 and false > 0.825 and false < 0.15 and true > 0.825.

Vvjjkkii renamed this task from Audit deployed editquality models and figure out why if the models are bad to txcaaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii removed Ladsgroup as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Wargo renamed this task from txcaaaaaaa to Audit deployed editquality models and figure out why if the models are bad.Jul 1 2018, 9:31 AM
Wargo assigned this task to Ladsgroup.
Wargo raised the priority of this task from High to Needs Triage.
Wargo updated the task description. (Show Details)
Wargo added a subscriber: Aklapper.
Ladsgroup raised the priority of this task from Low to Needs Triage.
Ladsgroup moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.