Lack of intersection between damaging & goodfaith for English Wikipedia
Closed, ResolvedPublic
Actions

Description

The team working on Edit-Review-Improvements is hoping to support a process by which patrollers look for damaging edits by good-faith new editors.

Currently, they are hoping to find edits that are "likely" to be damaging (operationalized as recall_at_precision(min_precision=0.6) == 0.879) and "very likely" to be goodfaith (operationalized as recall_at_precision(min_precision=0.995) == 0.86), but they aren't finding any in practice.

This task is done when we explore what the implications of these operationalizations. Is it a failure of the prediction models that we can't find these edits? Or is it an improper operationalization?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• jmatazzoni	T163757 ORES on en.wiki chokes on "V. likely good faith" + "Likely have problems"
		Resolved		Halfak	T163995 Lack of intersection between damaging & goodfaith for English Wikipedia

Event Timeline

Halfak created this task.Apr 27 2017, 2:02 PM

Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptApr 27 2017, 2:02 PM

Halfak lowered the priority of this task from High to Medium.Apr 27 2017, 2:38 PM

Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

Halfak claimed this task.May 3 2017, 10:27 PM

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

I plan to examine this tomorrow. My general strategy will be this:

Gather random sample of revisions from Wikipedia (using recentchanges)
Score all revisions in the sample and obtain the damaging.true and goodfaith.true scores
Plot intersections between scores. Spot check a small sample of likely damaging and likely good-faith scores
Report and make recommendations.

My analysis: https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_edit_quality/Work_log/2017-05-04

My recommendation: Don't set such strict thresholds. Models will still be useful at lower levels of confidence.

Thanks @Halfak . Since we already have a lower-probability Quality filter ("May have problems"), I assume you are suggesting we lower the "Very likely good faith" filter threshold some? (Just to point out: it's already possible to find intersections between t "May have problems," and "Very likely good faith." The problem is that the lower probability of the "May" filter is not ideal, so it would be nice to be able to use "Likely have problems.")

In terms of the thresholds reflected int his spreadsheet, what settings would yo recommend?

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.May 4 2017, 9:29 PM

Halfak mentioned this in T164547: Implement score_revisions utility.May 4 2017, 10:44 PM

@jmatazzoni, I recommend the threshold I used in the analysis filter_rate_at_recall(min_recall=0.75) for damaging and recall_at_precision(min_precision=0.99) for goodfaith as that gave me useful results.

Catrope mentioned this in T161655: Damaging levels on Polish Wikipedia overlap too much.May 4 2017, 11:02 PM

• jmatazzoni added a comment.May 5 2017, 10:07 PM

This comment was removed by • jmatazzoni.

• jmatazzoni mentioned this in T164621: Adjust ORES levels on en.wiki to get better overlap between good faith and damage.May 5 2017, 10:14 PM

@Halfak, thanks for the recommendations. Let me take them in turn to make sure I understand:

Move goodfaith to .99.

Good idea. I see no downside. I'll put in a ticket to make this happen.

Move "damaging" to .75.

I didn't know we had a recall=.75 setting. I don't think this would work for our "Likely" filter, since it would provide precision of, what, something like 20%? That's not what most people mean by "likely," and it doesn't provide much differentiation from the "May" filter levels.

Why don't we try moving "Likely have problems" to precision=.45 and see how that goes, with the other change. What do you think?

@jmatazzoni, I don't think that precision-focused thresholds match the use-case of damage patrolling -- whether it is focused on good-faith damage or vandalism. I believe that basing thresholds on recall is how patrollers would rather direct their work. After all, a prediction with high precision and low recall would not allow a patroller to catch "most of the damage". I think the naming scheme for the thresholds used in ERI is unfortunate in this regard and I'm not sure what direction I'd recommend from a product perspective. If you'll recall, I brought up this concern several times during the initial designs of the ERI filters.

In my qualitative analysis of edits that met this threshold, I found that 19 out of 30 were some version of damaging. Using a chi^2 statistic, we can say with 95% confidence that the precision at that threshold is between 43.9% and 79.5% for a sample of recent, non-bot edits (last 30 days at the time I ran the query).

• jmatazzoni edited projects, added Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017); removed Collaboration-Team-Triage.May 9 2017, 8:41 PM

• jmatazzoni moved this task from Untriaged to Product/Design Work on the Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017) board.

We are getting intersection now, so I'm closing this ticket. But we'll keep an eye on the issue of whether the results from "V. likely good faith" + "Likely problems" are useful to users.

• jmatazzoni closed this task as Resolved.May 9 2017, 10:09 PM

Halfak mentioned this in Blog Post: Status update (June 3rd, 2017).Jun 3 2017, 8:24 PM

Lack of intersection between damaging & goodfaith for English WikipediaClosed, ResolvedPublicActions