Maniphest T194742

Audit deployed editquality models and figure out why if the models are bad
Open, LowPublic
Actions

Assigned To

None

Authored By

	Ladsgroup
	May 15 2018, 10:45 AM

Description

Also make new labeling campaigns if it's too old.

Related Objects

Mentioned Here: P7202 ores models stats

Event Timeline

Ladsgroup created this task.May 15 2018, 10:45 AM

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptMay 15 2018, 10:45 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• Rfarrand moved this task from Backlog to Project on the Wikimedia-Hackathon-2018 board.May 17 2018, 12:23 PM

I wrote a script that goes through all models and checks their ROC AUC and sort them based on worst to the best. This is the result:

P7202 ores models stats

1	wiki model roc_auc_micro
2	fiwiki damaging 0.827
3	plwiki damaging 0.841
4	kowiki reverted 0.874
5	ukwiki reverted 0.88
6	ruwiki wp10 0.889
7	frwikisource pagelevel 0.891
8	hewiki damaging 0.894
9	itwiki reverted 0.902
10	fiwiki goodfaith 0.902
11	plwiki goodfaith 0.902
12	frwiki damaging 0.904
13	tawiki reverted 0.905
14	dewiki reverted 0.907
15	trwiki wp10 0.91
16	elwiki reverted 0.914
17	cswiki damaging 0.919
18	frwiki wp10 0.921
19	hrwiki reverted 0.921
20	simplewiki damaging 0.924
21	eswiki damaging 0.924
22	enwiki damaging 0.924
23	ptwiki damaging 0.924
24	simplewiki goodfaith 0.925
25	enwiki goodfaith 0.925
26	ruwiki damaging 0.925
27	ruwiki goodfaith 0.928
28	bnwiki reverted 0.928
29	ptwiki goodfaith 0.931
30	frwiki goodfaith 0.933
31	eswiki goodfaith 0.935
32	arwiki damaging 0.936
33	sqwiki goodfaith 0.938
34	eswikiquote reverted 0.939
35	trwiki goodfaith 0.94
36	simplewiki wp10 0.941
37	trwiki damaging 0.941
38	enwiki wp10 0.941
39	huwiki damaging 0.943
40	iswiki reverted 0.946
41	sqwiki damaging 0.951
42	idwiki reverted 0.953
43	hewiki goodfaith 0.956
44	viwiki reverted 0.957
45	nlwiki damaging 0.957
46	rowiki damaging 0.958
47	rowiki goodfaith 0.959
48	eswikibooks damaging 0.96
49	fawiki goodfaith 0.961
50	fawiki damaging 0.962
51	cswiki goodfaith 0.963
52	etwiki damaging 0.963
53	nlwiki goodfaith 0.97
54	nowiki reverted 0.972
55	wikidatawiki goodfaith 0.972
56	wikidatawiki itemquality 0.974
57	cawiki damaging 0.976
58	svwiki damaging 0.977
59	svwiki goodfaith 0.977
60	etwiki goodfaith 0.978
61	lvwiki damaging 0.979
62	arwiki goodfaith 0.979
63	enwiktionary reverted 0.981
64	eswikibooks goodfaith 0.982
65	simplewiki draftquality 0.983
66	enwiki draftquality 0.983
67	wikidatawiki damaging 0.986
68	huwiki goodfaith 0.987
69	lvwiki goodfaith 0.991
70	cawiki goodfaith 0.992
71	testwiki damaging 0.996
72	testwiki goodfaith 0.996
73	testwiki reverted 0.996

Now it's time to think what should we do with fiwiki, plwiki, hewiki damaging models :/

@Zache, @eranroz, and @Wargo, how are ORES damage detection models working on your wikis? Our stats suggest they are not very accurate, but we want to know about your experiences with them.

On hewiki ORES damaging model works OK, but there is still a room for improvement.
I think it is somewhat too conservative (by too conservative I mean FN, damaging edits that get low probability of ~0.4-0.5).

Currently both goodfaith and damaging are giving weighted scores (to good end). For example, goodfaith is pretty much always 0.95 or better. Same is true with the damaging too but weighting seems not to be that bad.

However, the weighting seems to be systematic and if we ignore that the damaging model is currently better for detecting actual damage than what it was in summer 2017 when it was more biased against IP-editors. Damaging model is currently also better than goodfaith.

As for practical use, my seulojabot currently approves popular culture edits if ORES goodfaith true is >0.95 and damaging true <0.15 and non-BLP edits if goodfaith true >0.99 and damaging true <0.015 and those limits seems to be unproblematic.

Ping @4shadoww , how about stabilizerbot stats?

I think the models have improved from the summer 2017. From my experience the damaging model doesn't do much if any false positives, which I think is a good thing. But in the other hand it does a lot of false negatives. So it doesn't detect damaging edits properly. Far as I know goodfaith model is just not very accurate as it makes both false positives and negatives.

Stabilizerbot makes mostly mistakes because of some other method of detecting harmful edits than ores. Ores seems to make pretty rarely false positives as the bot requires both models goodfaith and damaging scores to be true < 0.15 and false > 0.825 and false < 0.15 and true > 0.825.

• Vvjjkkii renamed this task from Audit deployed editquality models and figure out why if the models are bad to txcaaaaaaa.Jul 1 2018, 1:10 AM

• Vvjjkkii removed Ladsgroup as the assignee of this task.

• Vvjjkkii triaged this task as High priority.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

Wargo renamed this task from txcaaaaaaa to Audit deployed editquality models and figure out why if the models are bad.Jul 1 2018, 9:31 AM

Wargo assigned this task to Ladsgroup.

Wargo raised the priority of this task from High to Needs Triage.

Wargo removed projects: TCB-Team (now WMDE-TechWish), Mail, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

Wargo updated the task description. (Show Details)

Wargo added a subscriber: Aklapper.

nshahquinn-wmf removed a project: New-Editor-Experiences.Jul 3 2018, 10:01 PM

Ladsgroup removed Ladsgroup as the assignee of this task.Sep 28 2018, 5:15 PM

Ladsgroup edited projects, added Machine-Learning-Team; removed Wikimedia-Hackathon-2018, Machine-Learning-Team (Active Tasks).

Ladsgroup triaged this task as Low priority.Nov 28 2018, 6:39 AM

Ladsgroup raised the priority of this task from Low to Needs Triage.

Ladsgroup moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

Ladsgroup triaged this task as Low priority.Dec 5 2018, 2:29 PM

Ladsgroup removed a project: User-Ladsgroup.Apr 17 2019, 7:02 PM

Ladsgroup unsubscribed.Apr 17 2019, 7:14 PM

• ACraze moved this task from Maintenance/cleanup to Backlog/ORES on the Machine-Learning-Team board.Jan 19 2021, 10:19 PM

Audit deployed editquality models and figure out why if the models are badOpen, LowPublicActions

Description

Related Objects

Event Timeline

Audit deployed editquality models and figure out why if the models are bad
Open, LowPublic
Actions