Page MenuHomePhabricator

Train/test copyvio detection model for text
Open, MediumPublic

Description

Edits: tools.wmflabs.org/eranbot/edits_eranbot.pkl

Looks like we need to generate a list of edits that were saved while the bot was online.

See https://gist.github.com/eranroz/88f7dd4c568a764fb8150375942b2223 for a query to get these revisions

Event Timeline

I added login capabilities to the feature extractor so that we can process suppressed text. https://github.com/wiki-ai/revscoring/pull/256

OK. I have a dataset with 10 million edits from the live timeperiods of the copyvio bot. next step is to produce a labeled dataset and try to see if we can train something with fitness. I think we'll need to sample the "false" observations.

The new CopyPatrol tool is slowly building up a dataset of manually confirmed copyright violations. If this is useful for training, let me know.

Halfak triaged this task as Medium priority.Aug 4 2016, 2:32 PM
Halfak mentioned this in Unknown Object (Task).Sep 6 2016, 6:42 PM
Halfak mentioned this in Unknown Object (Task).Dec 14 2016, 10:55 PM
eranroz renamed this task from Train/test copyvio detection model to Train/test copyvio detection model for text.Aug 15 2019, 3:52 PM