Analyze data to identify patterns and features of spambots
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Pablo
	Oct 15 2021, 3:59 PM

Description

		Status	Subtype	Assigned	Task
		Resolved		Pablo	T288338 [EPIC] Spambot Detection System to Support Stewards
		Resolved		Pablo	T293500 Analyze data to identify patterns and features of spambots

Weekly updates:

Data was analyzed to examine the distribution of time for spambots to get globally locked
Attendance to the first session of Disinformation Design

Weekly updates:

Attendance to the second session of Disinformation Design
A script to retrieve links in Wikipedia editions has been shared for initiating the link parsing process

Weekly updates:

Weekly updates:

URLs have been extracted from the existing datasets:
- visible revisions by editors globally locked as spambots
- deleted revisions by editors globally locked as spambots
- deleted revisions that hit Abuse Filter rules about spam or link addition
A new dataset has been created with these URLs including:
- general metadata: text, domain, wiki_db, revision_id, source (spambot_diff, spambot_deleted, abuse_filter_deleted)
- revision metadata from MediaWiki_history

Weekly updates:

It has found a new way to retrieve the text of edits that have hit an Abuse Filter rule with "disallow"/"block" actions (the existing dataset only contained revisions that have hit Abuse Filter rules with "tag"/"warn" actions). T&S has been contacted for research access to this specific data.
A call with researchers from Georgia Tech was organized and shared progress in work related with sockpuppet, ban evasion and spambot detection.

Weekly updates:

Launch Spark process to retrieve all URLs in revisions of each wiki (revisions grouped by page to keep the first occurence of the URL)
Profiling of the spambots' URLs dataset did not present relevant warnings

Weekly updates:

Start working on matching between spambot related urls and urls from the entire dump to measure the presence of spam in wikis

Weekly updates:

Weekly updates:

After intense data exploration and cleaning, the spambots URLs dataset is filtered to just keep URLs:
- of a domain matching .*[a-z0-9].[a-z0-9].*
- of a domain not .org .gov .edu
- of a domain not included in a curated black list
- from wiki revisions (several URLs from wikivoyage, wikinews projects are not spam related)
- from spambot revisions (deleted revisions that hit Abuse Filter rules about spam or link addition were often false positives)
- from spambot with less than 5 edits (many users were globally locked as spambots because of other activities such as vandalism, sock puppeting, so their URLs were not spam related)
The resulting dataset contains 5718 rows of URLs from spambot deleted revisions, 1928 rows of URLs from spambot visible revisions and 155 rows of URLs from revisions of the dump.
It has been identified information about editors not globally locked sharing spambot URLs and existing pages with spambot URLs. This and general findings will be shared with stewards.

Weekly updates:

Update project progress in Meta and Betterworks
Create a dataset of URLs from the historical dump to compare with the spambot dataset (a preliminary list of features have been defined)

Weekly updates:

The first comparison between URLs in deleted revisions by editors globally locked as spambots and a random sample of URLs revealed some different patterns regarding username, revision timing (e.g., time between revision and creation or previous edit by that account or in that page), namespace of the targeted page, etc.

The above results might be affected by filtering criteria of the spambot dataset, in particular, the maximum number of edits of spambots (thresold=5). In fact, this is evident when the most common tag for spambot revisions is "newbie external link", while the most common tag for the sample dataset is "visual editor". Therefore, next steps will focus on a more controlled comparison applying such filters to the sample dataset.

Weekly updates:

Tested several standard machine learning models (LR, RF, SVM and KNN) to classify URLs as spambot and sample classes from the above datasets.
Results, including feature importance analysis, indicate the high predictive value of features identified during data exploration, e.g.: revision_page_namespace_is_content, revision_hours_since_page_previous_edit, user_hours_since_editor_first_edit, user_hours_since_editor_previous_edit, user_edit_count or user_hours_since_account_creation.
A call with experts will be scheduled to discuss these results and next steps.

Pablo closed this task as Resolved.Jan 21 2022, 6:16 PM

	F34770144: image.png
	Nov 26 2021, 5:16 PM