Page MenuHomePhabricator

arwiki goodfaith model is not usable
Open, HighPublic

Description

Looking at https://ores.wikimedia.org/v3/scores/arwiki?models=goodfaith&model_info=statistics.thresholds.false , I see that the precision never exceeds 6% of so, and except for the 0.000 and 0.001 thresholds, recall does not exceed 9%.

Event Timeline

Restricted Application added subscribers: Ahmed123, Aklapper. · View Herald Transcript

For context, when setting thresholds for use in RCFilters, we typically look for:

  • Maybe good faith: 90% recall or 15% precision, whichever is better
  • Likely good faith: 60% precision
  • Very likely good faith: 90% precision

We sometimes tweak these depending on the model characteristics, and sometimes drop one of these three if it doesn't make sense to have it, but with this model there isn't a single threshold setting that even comes close to 15% precision, let alone the others.

I talked to @Ladsgroup about this at the hackathon. I think that we'll need an arabic speaker/reader to help us review a sample of good-faith labels that we got from the campaign. It could be that there was a misunderstanding about the meaning of this label. If so, we can probably find out which labeler was getting things wrong and get those observations re-labeled.

It could also be that goodfaith is hard to distinguish in arwiki, but that seems unlikely given that our damaging model is working OK.

Okay, I have been checking this, I know a little bit of Arabic (long story). First most of labels came from a user. That's not good and makes the model more subjective than objective. Also @Halfak labeled 96 edits there.

What stands out to me is that number of edits that are marked "goodfaith and damaging" (=honest mistakes) is extremely high:

less arwiki.json | jq ".tasks[] | select(.labels[].data.goodfaith=="true" and .labels[].data.damaging=="true") | .data.rev_id"| wc -l
266

Comparing to "not goodfaith and damaging" (=vandalism):

amsa@C235:~/workspace$ less arwiki.json | jq ".tasks[] | select(.labels[].data.goodfaith=="false" and .labels[].data.damaging=="true") | .data.rev_id"| wc -l
89

Even funnier we have 12 case of "not goodfaith and not damaging" (I don't know what name I can put on this). I kept the list in a subpage. What I can do is to remove any data in these three cases and ask another native Arabic speaker to label them again. They won't be more than 400 cases. So easily doable and I know some community members. What do you think?

Hello all,
Is there any help that I can do here?

Hey @alanajjar. Thanks for showing interest, It seems ORES has some issues with Arabic Wikipedia. Can you re-review some edits to improve the model? if yes, do you know how to label edits for ORES? Thanks again!

Hey @alanajjar. Thanks for showing interest, It seems ORES has some issues with Arabic Wikipedia. Can you re-review some edits to improve the model? if yes, do you know how to label edits for ORES? Thanks again!

Please explain more how to do that, because as you see in T131669 that @Ghassanmas is the most active in this side!

Also, if you explain what needed from ar.wiki community in points, I'll try to bring few active highly experienced users to help on that.

Hello All,

While labeling the samples I have always had speculation about how the model could detect good faiths. given in mind the ratio of bad edits to in each campaign is
about (1-4)/50 samples where the bad faiths are less of that (1-4). I have expressed my opinion about that and I have been told the ratio is normal.

In order to label a sample as a bad faith, usually I labeled according to the following:

  • When the edit has a swear words.
  • The editor is expressing his/her own point of view.
  • Vandalism edit: where it seems the editor is expressing negative emotion about the context of the article.

These measure I have been following also came up as I have been recommended to expect that the editor has a good intention when ever it's blurry to say if it's a good faith or not. Again the probability of blurry edit in terms of good faith is less than 1 in 100 samples.

What are the features that used in order to detect a bad faiths, beside features which detects words counts and swearing words? Because if we are using only words frequency I don't expect the good faith model to be good...

Everything that falls under vandalism should be marked as "badfaith and damaging", other types (e.g. "The editor is expressing his/her own point of view."), should be considered as "goodfaith but damaging". Examples would make this more clear. Do you have some in hand?

Regarding expressing opinions it wasn't always the case that I would label it as a "bad faith damaging", since it's a common case where editor are expressing their opinions in a respective manner "like in conversational topic", so in that case I would label it as you said "goodfaith and damaging" but whenever the editor is expressing their opinion in vary harsh manner or their opinion is based purely on political agenda, I would label it as a bad faith.
Again I guess the features plays crucial rule in classifying, also since Arabic is close to Persian you may have some insights about that.

Some bad faith examples:

-------------------------------------RevID----------------------------------------------Label-----------------------------------------------reason
https://ar.wikipedia.org/w/index.php?diff=14922859 --------------------Bad faith Damaging --------------------------------- Swearing words
https://ar.wikipedia.org/w/index.php?diff=15480032 --------------------Bad faith Damaging---------------------------- forcing opinion /Political Agenda
https://ar.wikipedia.org/w/index.php?diff=14679898 --------------------Bad faith Damaging--------------------------------------vandalism
https://ar.wikipedia.org/w/index.php?diff=16863444 ----------------Good faith/not Damaged-----------------Editor added content but the contents is not organized
http://ar.wikipedia.org/w/index.php?diff=15820851 ---------------------Bad faith Damaging-----------------------forcing opinion / /Political Agenda

The bad faiths examples are gathered by re scanning more than 200 labels, that are already labeled, these were all the "bad faiths Damaging" I could gather among the samples I re scanned, but for " Good faith/ not Damaging " there are more, I included only one to get a sense of how I was labeling...

Lastly is the ratio of bad faith edits enough to create accurate classifier...

CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 5 2018, 6:37 PM
Halfak triaged this task as High priority.Feb 15 2019, 7:59 PM
Halfak lowered the priority of this task from High to Low.
Halfak raised the priority of this task from Low to High.

@calbon any update about this project?, I am happy to help if needed.

@Ghassanmas Huge thanks for the offer to help. To give context, we are planning on retraining all the models in the future from new data which would solve this particular task.

Sounds great!, I may be able to support you in terms of labeling data, I am currently building a risk communication platform, for the UNICEF, one of the thing they want, is the ability to classify communities inquires e.g. fake-news/rumor, question or general feedback...etc. We are trying to create a dataset (Arabic) in order to be able to build such model.

To give you more context, we are getting the inquiries from the Palestinian community, and these inquiries are written in Arabic (Amyeh) version not the Fusha/formal.
What are you willing to use for creating the model?, e.g. word count, word2vec, attention model...etc

I can say an opportunity for us to collaborate, may be give you access to human resources for labeling data or sharing dataset with you.

You can learn more about my agency at https://zaat.dev