Page MenuHomePhabricator

Implement balanced not-damaging/maybe-damaging edit extractor for `editquality`
Closed, ResolvedPublic


Some wikis are dominated by bots. These wikis tend to have few reverted edits to train on. We need a good strategy for gathering balanced sets non-damaging and potentially damaging edits from Wikipedia.

editquality extract_balanced_sample <dump-file>... --host=<url> [--start=<date>] [--end=<date>] 

This script should process edits to pages that fall between start and end and keep track of sets of revision IDs that correspond to are either reverted or not reverted. In the end, it should output a TSV file that has three columns: rev_id, potentially_damaging and reason. The dataset should contain the exact same number of potentially_damaging and not potentially_damaging rows. If --reverted-only only reverted edits will be considered potentially damaging (useful for training reverted models). Otherwise, all edits from non-trusted users will be flagged as well.

Event Timeline

Halfak created this task.Dec 9 2015, 8:21 PM
Halfak claimed this task.
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak added a subscriber: Halfak.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 9 2015, 8:21 PM

@Ladsgroup has done some work here. I'm not sure where that code lives. We could probably merge that within editquality.

Halfak reassigned this task from Halfak to Ladsgroup.Dec 23 2015, 4:18 AM
Halfak moved this task from Active to Backlog on the Scoring-platform-team (Current) board.
Halfak set Security to None.

@Ladsgroup submitted this PR:

I have some reservations, so I'll be taking a pass at it next.

Just took a pass at it again. I think that this dump extraction strategy doesn't make sense when we need to look up user-info. I think we should modify this script to be more simplistic -- to just get reverted-and-likely-damaging-edits vs. not. We'll want to think carefully about how we'll make balanced scripts for labeling in Wikilabels, but it seems that filtering out bot edits and focusing on human-damage/not-damage is a good first step.

Halfak claimed this task.Jan 8 2016, 6:08 PM
Halfak moved this task from Backlog to Review on the Scoring-platform-team (Current) board.
Halfak reassigned this task from Halfak to Ladsgroup.Jan 15 2016, 5:58 PM
Halfak moved this task from Review to Done on the Scoring-platform-team (Current) board.
Halfak closed this task as Resolved.Jan 21 2016, 3:44 PM