Page MenuHomePhabricator

Analyze data to identify patterns and features of spambots
Closed, ResolvedPublic

Description

Event Timeline

Weekly updates:

  • Data was analyzed to examine the distribution of time for spambots to get globally locked
  • Attendance to the first session of Disinformation Design

Weekly updates:

Weekly updates:

  • Attendance to the third session of Disinformation Design

Weekly updates:

  • URLs have been extracted from the existing datasets:
    • visible revisions by editors globally locked as spambots
    • deleted revisions by editors globally locked as spambots
    • deleted revisions that hit Abuse Filter rules about spam or link addition
  • A new dataset has been created with these URLs including:
    • general metadata: text, domain, wiki_db, revision_id, source (spambot_diff, spambot_deleted, abuse_filter_deleted)
    • revision metadata from MediaWiki_history

Weekly updates:

  • It has found a new way to retrieve the text of edits that have hit an Abuse Filter rule with "disallow"/"block" actions (the existing dataset only contained revisions that have hit Abuse Filter rules with "tag"/"warn" actions). T&S has been contacted for research access to this specific data.
  • A call with researchers from Georgia Tech was organized and shared progress in work related with sockpuppet, ban evasion and spambot detection.

Weekly updates:

  • Launch Spark process to retrieve all URLs in revisions of each wiki (revisions grouped by page to keep the first occurence of the URL)
  • Profiling of the spambots' URLs dataset did not present relevant warnings

image.png (688×810 px, 138 KB)

Weekly updates:

  • Start working on matching between spambot related urls and urls from the entire dump to measure the presence of spam in wikis

Weekly updates:

  • The matching between spambot related urls and urls from the entire dump reveal new challenges:
    • Some editors globally locked as spambots are not really spambots but correspond to another type of malicious behavior (e.g. sockpuppets).
    • Many urls added by spambots were not spam related. Different filters are being explored to improve the quality of the urls dataset:
      • Keeping only urls with well formatted domain (regex)
      • Keeping only urls not included in a black list of non-spam domains
      • Keeping only urls with no other URL from the same domain
      • Keeping only urls shared by multiple spambots or shared in multiple wikis

Weekly updates:

  • After intense data exploration and cleaning, the spambots URLs dataset is filtered to just keep URLs:
    • of a domain matching .*[a-z0-9].[a-z0-9].*
    • of a domain not .org .gov .edu
    • of a domain not included in a curated black list
    • from wiki revisions (several URLs from wikivoyage, wikinews projects are not spam related)
    • from spambot revisions (deleted revisions that hit Abuse Filter rules about spam or link addition were often false positives)
    • from spambot with less than 5 edits (many users were globally locked as spambots because of other activities such as vandalism, sock puppeting, so their URLs were not spam related)
  • The resulting dataset contains 5718 rows of URLs from spambot deleted revisions, 1928 rows of URLs from spambot visible revisions and 155 rows of URLs from revisions of the dump.
  • It has been identified information about editors not globally locked sharing spambot URLs and existing pages with spambot URLs. This and general findings will be shared with stewards.

Weekly updates:

  • Update project progress in Meta and Betterworks
  • Create a dataset of URLs from the historical dump to compare with the spambot dataset (a preliminary list of features have been defined)

Weekly updates:

  • The first comparison between URLs in deleted revisions by editors globally locked as spambots and a random sample of URLs revealed some different patterns regarding username, revision timing (e.g., time between revision and creation or previous edit by that account or in that page), namespace of the targeted page, etc.
  • The above results might be affected by filtering criteria of the spambot dataset, in particular, the maximum number of edits of spambots (thresold=5). In fact, this is evident when the most common tag for spambot revisions is "newbie external link", while the most common tag for the sample dataset is "visual editor". Therefore, next steps will focus on a more controlled comparison applying such filters to the sample dataset.

Weekly updates:

  • Tested several standard machine learning models (LR, RF, SVM and KNN) to classify URLs as spambot and sample classes from the above datasets.
  • Results, including feature importance analysis, indicate the high predictive value of features identified during data exploration, e.g.: revision_page_namespace_is_content, revision_hours_since_page_previous_edit, user_hours_since_editor_first_edit, user_hours_since_editor_previous_edit, user_edit_count or user_hours_since_account_creation.
  • A call with experts will be scheduled to discuss these results and next steps.