Page MenuHomePhabricator

Formalize the problem space and create a dataset of spambots activities
Closed, ResolvedPublic

Description

Event Timeline

Pablo removed a project: Epic.

Weekly updates:

Weekly updates:

  • The initial list of spambots has been extended with stats around the edit count per existing wiki.
  • A preliminary data analysis has been conducted revealing that only a minor fraction of globally locked editors as spambots had generated cross-wiki activity.

Weekly updates:

  • Prepare script to retrieve revisions from Mediawiki_wikitext_history of revisions created by spambots
  • Prepare script to compute diff between revisions and parent revisions

Weekly updates:

Weekly updates:

  • The retrieval process of spambots revision diffs has been finally performed through the API
  • The exploration of the dataset of spambots and the dataset of spambots revision diffs has revealed challenges for the initial goal of the project: cross-wiki spambot detection. In particular, few spambots made cross-wiki edits and few of their revisions are available.
    • Two meetings with T&S have been scheduled for next week to discuss these results.

Weekly updates:

  • Three meetings have been held (two with T&S staff and one with a steward) to identify why few spambots had cross-wiki edits: most activity of spambots might be stored as hits in the logs of AbuseFilter or Spamblacklist.
  • These conversations also led to formalizing the problem of characterizing spambots as the characterizing whether a URL/domain is related to spambot-driven activities.
  • The problem of missing data on deleted revisions has been reviewed with research engineering and more examples will be analyzed to better identify how these contents are handled in our databases.

Weekly updates:

  • Progress in Q1 has been reported on meta, including the literature review.
  • Examples of missing data from deleted revisions were provided to research engineering and the investigation revealed that (some) deleted revision texts are findable on Archive tables on MariaDB.
  • A new script has been developed to retrieve texts of deleted revisions from such tables.

Weekly updates:

  • A script has been launched to retrieve all diffs from revisions that have hit AbuseFilter rules about spamming or link addition
  • This work was presented at the TTO'21 conference.

Weekly updates:

  • A script was launched to retrieve all diffs from revisions that hitted global AbuseFilter rules about spamming or link addition (these filters are defined at metawiki)