Edits: tools.wmflabs.org/eranbot/edits_eranbot.pkl
Looks like we need to generate a list of edits that were saved while the bot was online.
See https://gist.github.com/eranroz/88f7dd4c568a764fb8150375942b2223 for a query to get these revisions
Edits: tools.wmflabs.org/eranbot/edits_eranbot.pkl
Looks like we need to generate a list of edits that were saved while the bot was online.
See https://gist.github.com/eranroz/88f7dd4c568a764fb8150375942b2223 for a query to get these revisions
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T131481 Train/test copyvio detection model for text | |||
Open | Spike | None | T102344 [Spike] Explore how we could train models on suppressed and revdeleted content | ||
Open | None | T209960 Allow privileged users to label deleted revision in Wikilabels |
I added login capabilities to the feature extractor so that we can process suppressed text. https://github.com/wiki-ai/revscoring/pull/256
OK. I have a dataset with 10 million edits from the live timeperiods of the copyvio bot. next step is to produce a labeled dataset and try to see if we can train something with fitness. I think we'll need to sample the "false" observations.
The new CopyPatrol tool is slowly building up a dataset of manually confirmed copyright violations. If this is useful for training, let me know.