Page MenuHomePhabricator

Implement balanced not-damaging/maybe-damaging edit extractor for `editquality`
Closed, ResolvedPublic


Some wikis are dominated by bots. These wikis tend to have few reverted edits to train on. We need a good strategy for gathering balanced sets non-damaging and potentially damaging edits from Wikipedia.

editquality extract_balanced_sample <dump-file>... --host=<url> [--start=<date>] [--end=<date>] 

This script should process edits to pages that fall between start and end and keep track of sets of revision IDs that correspond to are either reverted or not reverted. In the end, it should output a TSV file that has three columns: rev_id, potentially_damaging and reason. The dataset should contain the exact same number of potentially_damaging and not potentially_damaging rows. If --reverted-only only reverted edits will be considered potentially damaging (useful for training reverted models). Otherwise, all edits from non-trusted users will be flagged as well.

Event Timeline

Halfak claimed this task.
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak added a subscriber: Halfak.

@Ladsgroup has done some work here. I'm not sure where that code lives. We could probably merge that within editquality.

Halfak moved this task from Active to Backlog on the Machine-Learning-Team (Active Tasks) board.
Halfak set Security to None.

@Ladsgroup submitted this PR:

I have some reservations, so I'll be taking a pass at it next.

Just took a pass at it again. I think that this dump extraction strategy doesn't make sense when we need to look up user-info. I think we should modify this script to be more simplistic -- to just get reverted-and-likely-damaging-edits vs. not. We'll want to think carefully about how we'll make balanced scripts for labeling in Wikilabels, but it seems that filtering out bot edits and focusing on human-damage/not-damage is a good first step.