Unsophisticated bad actors dataset
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Halfak
	May 15 2020, 5:08 PM

Description

This task is done when we have a open licensed dataset containing labeled data about unsophisticated bad actors and good contributors to Wikipedia.

In this case "unsophisticated" is contrasted to "sophisticated" bad behavior that is much harder to identify and label.

Related Objects
Search...

Status	Assigned	Task
Open	None	T252894 Unsophisticated bad actors dataset
Open	None	T252895 Create wikilabels campaign for unsophisticated bad actors dataset.
Open	None	T252896 Decide on unsophisticated bad actors labels.

Event Timeline

Halfak created this task.May 15 2020, 5:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 15 2020, 5:08 PM

Hey folks. I'm creating this task to help us coordinate. I'll start creating some sub-tasks that could be part of the process.

Halfak triaged this task as Medium priority.May 18 2020, 4:59 PM

Halfak moved this task from Unsorted to Epic on the Machine-Learning-Team board.

So, I've started trying to build the set of known sockpuppet groups based on userpage tagging and block summaries. So far I have 14,700 masters and a total of 174,000 accounts. If we cut that down to cases with 10 or more confirmed accounts, it's 131,000 accounts across 3,100 masters. Currently I'm formatting this as a JSON file with the following schema:

{
    master_username: {
        "user": master_username,
        "socks": {
            sock_username: sock_block_object,
            ...
        },
        "block": master_block_object,
    },
    ...
}

A block object looks like this:

{"id": 6897138, "user": "Alchemy World", "userid": 29247980, "by": "Randykitty", "timestamp": "2016-09-24T09:29:01Z", "expiry": "infinity", "reason": "{{checkuserblock-account}}: Abusing [[WP:SOCK|multiple accounts]]: Please see: [[Wikipedia:Sockpuppet investigations/Group periodic table]]", "rangestart": "0.0.0.0", "rangeend": "0.0.0.0", "nocreate": "", "autoblock": "", "allowusertalk": ""}

But I can also enrich this with more information to aid the categorization. I'm thinking that the master's block log may have useful clues, and we may even be able to automatically categorize some masters by searching for "vandalism", "spam", "harassment", etc. in their block log. For cases that go through the manual categorization process, I'd suggest:

Link to the SPI, if there is one
Link to the LTA page, if there is one
Master's block log
List of master's N most recent contributions

By the way, this json file is about 72MB so far. Anyone want to see it, and if so, any ideas where I can upload it?

This file structure makes sense to me. I think it would make sense to get this into a git repository so that we can maintain a version history. Even at 72MB, we have some good strategies for storing large files. Could you upload a sample of the dataset to phab? Maybe the first 10-20 master-users.

Niharika subscribed.May 26 2020, 11:45 PM

• ACraze moved this task from Epic to Backlog/Other on the Machine-Learning-Team board.Jan 19 2021, 8:38 PM

Samwalton9-WMF subscribed.Aug 8 2024, 7:40 AM

Unsophisticated bad actors datasetOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Unsophisticated bad actors dataset
Open, MediumPublic
Actions

Related Objects
Search...