Page MenuHomePhabricator

Lack of intersection between damaging & goodfaith for English Wikipedia
Closed, ResolvedPublic

Description

The team working on Edit-Review-Improvements is hoping to support a process by which patrollers look for damaging edits by good-faith new editors.

Currently, they are hoping to find edits that are "likely" to be damaging (operationalized as recall_at_precision(min_precision=0.6) == 0.879) and "very likely" to be goodfaith (operationalized as recall_at_precision(min_precision=0.995) == 0.86), but they aren't finding any in practice.

This task is done when we explore what the implications of these operationalizations. Is it a failure of the prediction models that we can't find these edits? Or is it an improper operationalization?

Event Timeline

Halfak created this task.Apr 27 2017, 2:02 PM
Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptApr 27 2017, 2:02 PM
Halfak lowered the priority of this task from High to Medium.Apr 27 2017, 2:38 PM
Halfak moved this task from Untriaged to New development on the Scoring-platform-team board.

I plan to examine this tomorrow. My general strategy will be this:

  1. Gather random sample of revisions from Wikipedia (using recentchanges)
  2. Score all revisions in the sample and obtain the damaging.true and goodfaith.true scores
  3. Plot intersections between scores. Spot check a small sample of likely damaging and likely good-faith scores
  4. Report and make recommendations.
Halfak added a comment.May 4 2017, 6:48 PM

My analysis: https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_edit_quality/Work_log/2017-05-04

My recommendation: Don't set such strict thresholds. Models will still be useful at lower levels of confidence.

jmatazzoni added a comment.EditedMay 4 2017, 8:45 PM

Thanks @Halfak . Since we already have a lower-probability Quality filter ("May have problems"), I assume you are suggesting we lower the "Very likely good faith" filter threshold some? (Just to point out: it's already possible to find intersections between t "May have problems," and "Very likely good faith." The problem is that the lower probability of the "May" filter is not ideal, so it would be nice to be able to use "Likely have problems.")

In terms of the thresholds reflected int his spreadsheet, what settings would yo recommend?

@jmatazzoni, I recommend the threshold I used in the analysis filter_rate_at_recall(min_recall=0.75) for damaging and recall_at_precision(min_precision=0.99) for goodfaith as that gave me useful results.

This comment was removed by jmatazzoni.

@Halfak, thanks for the recommendations. Let me take them in turn to make sure I understand:

Move goodfaith to .99.

Good idea. I see no downside. I'll put in a ticket to make this happen.

Move "damaging" to .75.

I didn't know we had a recall=.75 setting. I don't think this would work for our "Likely" filter, since it would provide precision of, what, something like 20%? That's not what most people mean by "likely," and it doesn't provide much differentiation from the "May" filter levels.

Why don't we try moving "Likely have problems" to precision=.45 and see how that goes, with the other change. What do you think?

Halfak added a comment.May 6 2017, 3:21 PM

@jmatazzoni, I don't think that precision-focused thresholds match the use-case of damage patrolling -- whether it is focused on good-faith damage or vandalism. I believe that basing thresholds on recall is how patrollers would rather direct their work. After all, a prediction with high precision and low recall would not allow a patroller to catch "most of the damage". I think the naming scheme for the thresholds used in ERI is unfortunate in this regard and I'm not sure what direction I'd recommend from a product perspective. If you'll recall, I brought up this concern several times during the initial designs of the ERI filters.

In my qualitative analysis of edits that met this threshold, I found that 19 out of 30 were some version of damaging. Using a chi^2 statistic, we can say with 95% confidence that the precision at that threshold is between 43.9% and 79.5% for a sample of recent, non-bot edits (last 30 days at the time I ran the query).

We are getting intersection now, so I'm closing this ticket. But we'll keep an eye on the issue of whether the results from "V. likely good faith" + "Likely problems" are useful to users.

jmatazzoni closed this task as Resolved.May 9 2017, 10:09 PM