Maniphest T120999

Implement balanced not-damaging/maybe-damaging edit extractor for `editquality`
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Dec 9 2015, 8:21 PM

Description

Some wikis are dominated by bots. These wikis tend to have few reverted edits to train on. We need a good strategy for gathering balanced sets non-damaging and potentially damaging edits from Wikipedia.

editquality extract_balanced_sample <dump-file>... --host=<url> [--start=<date>] [--end=<date>] 
                                                                [--trusted-groups=<groups>] 
                                                                [--trusted-edits=<num>] 
                                                                [--revert-radius=<revs>] 
                                                                [--revert-window=<hrs>]
                                                                [--reverted-only]
                                                                [--verbose] 
                                                                [--debug]

This script should process edits to pages that fall between start and end and keep track of sets of revision IDs that correspond to are either reverted or not reverted. In the end, it should output a TSV file that has three columns: rev_id, potentially_damaging and reason. The dataset should contain the exact same number of potentially_damaging and not potentially_damaging rows. If --reverted-only only reverted edits will be considered potentially damaging (useful for training reverted models). Otherwise, all edits from non-trusted users will be flagged as well.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Halfak	T120173 Edit quality campaign for Urdu Wikipedia
		Resolved		Ladsgroup	T120999 Implement balanced not-damaging/maybe-damaging edit extractor for `editquality`

Event Timeline

Halfak created this task.Dec 9 2015, 8:21 PM

Halfak claimed this task.

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Machine-Learning-Team (Active Tasks).

Halfak subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 9 2015, 8:21 PM

@Ladsgroup has done some work here. I'm not sure where that code lives. We could probably merge that within editquality.

@Ladsgroup submitted this PR: https://github.com/wiki-ai/editquality/pull/4

I have some reservations, so I'll be taking a pass at it next.

Halfak added a project: editquality-modeling.Dec 23 2015, 4:19 AM

Halfak added a parent task: T120173: Edit quality campaign for Urdu Wikipedia.Dec 23 2015, 4:27 AM

Just took a pass at it again. I think that this dump extraction strategy doesn't make sense when we need to look up user-info. I think we should modify this script to be more simplistic -- to just get reverted-and-likely-damaging-edits vs. not. We'll want to think carefully about how we'll make balanced scripts for labeling in Wikilabels, but it seems that filtering out bot edits and focusing on human-damage/not-damage is a good first step.

MuhammadShuaib subscribed.Jan 2 2016, 12:00 AM

Halfak claimed this task.Jan 8 2016, 6:08 PM

Halfak moved this task from Backlog to Review on the Machine-Learning-Team (Active Tasks) board.

Halfak reassigned this task from Halfak to Ladsgroup.Jan 15 2016, 5:58 PM

Halfak moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.

Halfak closed this task as Resolved.Jan 21 2016, 3:44 PM

• Phabricator_maintenance added a project: User-Ladsgroup.Aug 12 2016, 8:09 PM

Implement balanced not-damaging/maybe-damaging edit extractor for `editquality`Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Implement balanced not-damaging/maybe-damaging edit extractor for `editquality`
Closed, ResolvedPublic
Actions

Related Objects
Search...