Page MenuHomePhabricator

Unsophisticated bad actors dataset
Open, MediumPublic

Description

This task is done when we have a open licensed dataset containing labeled data about unsophisticated bad actors and good contributors to Wikipedia.

In this case "unsophisticated" is contrasted to "sophisticated" bad behavior that is much harder to identify and label.

Event Timeline

Hey folks. I'm creating this task to help us coordinate. I'll start creating some sub-tasks that could be part of the process.

Halfak triaged this task as Medium priority.May 18 2020, 4:59 PM
Halfak moved this task from Unsorted to Epic on the Machine-Learning-Team board.

So, I've started trying to build the set of known sockpuppet groups based on userpage tagging and block summaries. So far I have 14,700 masters and a total of 174,000 accounts. If we cut that down to cases with 10 or more confirmed accounts, it's 131,000 accounts across 3,100 masters. Currently I'm formatting this as a JSON file with the following schema:

{
    master_username: {
        "user": master_username,
        "socks": {
            sock_username: sock_block_object,
            ...
        },
        "block": master_block_object,
    },
    ...
}

A block object looks like this:

{"id": 6897138, "user": "Alchemy World", "userid": 29247980, "by": "Randykitty", "timestamp": "2016-09-24T09:29:01Z", "expiry": "infinity", "reason": "{{checkuserblock-account}}: Abusing [[WP:SOCK|multiple accounts]]: Please see: [[Wikipedia:Sockpuppet investigations/Group periodic table]]", "rangestart": "0.0.0.0", "rangeend": "0.0.0.0", "nocreate": "", "autoblock": "", "allowusertalk": ""}

But I can also enrich this with more information to aid the categorization. I'm thinking that the master's block log may have useful clues, and we may even be able to automatically categorize some masters by searching for "vandalism", "spam", "harassment", etc. in their block log. For cases that go through the manual categorization process, I'd suggest:

  • Link to the SPI, if there is one
  • Link to the LTA page, if there is one
  • Master's block log
  • List of master's N most recent contributions

By the way, this json file is about 72MB so far. Anyone want to see it, and if so, any ideas where I can upload it?

This file structure makes sense to me. I think it would make sense to get this into a git repository so that we can maintain a version history. Even at 72MB, we have some good strategies for storing large files. Could you upload a sample of the dataset to phab? Maybe the first 10-20 master-users.