Page MenuHomePhabricator

Build the spambot detection model
Closed, ResolvedPublic

Description

Event Timeline

Pablo added subscribers: leila, diego.

Weekly updates:

  • Schedule call with stewards to review first findings of a spambot detection model prototype

Weekly updates:

  • Conversation with a steward to share first findings of a spambot detection model prototype. It was suggested to reframe the link spamming focus into a edit quality focus.
  • Results will be shared with other stewards and T&S colleagues for open discussion on next steps.

Weekly updates:

  • Conversation with T&S staff to share first findings of a spambot detection model prototype. It was suggested to explore specific sophisticated forms of spamming.
  • Conversation with Research Lab group to share first findings of a spambot detection model prototype. Several resources were provided.
  • Recording of a video that will be shared with stewards for them to provide feedback during imminent leave.
  • Schedule call with the Global Head of Trust and Safety to share findings and discuss next steps.
  • Added README.md for continuity purposes during imminent leave.

Weekly updates:

  • Conversation with Global Head of Trust and Safety to share findings. Two possible (non-exclusive) next steps are:
    • Collect and analyze a dataset of first edits of editors to quantify if "newbie external link" behaviour is also present in good-faith editors.
    • Focus on sophisticated forms of spamming.

Weekly updates:

  • Conversation with Moderator Tools lead to share results and discuss next steps
  • Launch script to retrieve first edits of editors registered since 2020 to then analyze their edit type (the purpose is to quantify how often first edits contain an external link)

Weekly updates:

  • Adapted existing notebook to create the dataset of first edits of editors registered since 2020.
  • Review of literature on editing behaviors among new wikipedians.

Weekly updates

  • The notebook to collect first edits of editors has failed with large wikis because of memory issue (already reported in Research Weekly), so the remaining data from specific wikis will be collected in a stats machine.
  • Draft written with the design details and results of the different machine learning models (to be adapted to Mediawiki and then update meta).

Weekly updates

  • Completed dataset of first edits of editors registered from 2020 (~10% of editors inserted an external link in their first edit).
  • Working on feature selection and clustering

Weekly updates

  • Assessment of the machine learning model with a namespace-balanced dataset to mitigate biases (it was found that a large majority of deleted spambot revisions were often on namespace 2)
  • Update on meta of all the machine learning results
  • Call with the stewards to discuss ongoing and future work. They concluded that the best approach is to include this work on ORES, so I introduced them @diego's proposal on ML-Based Models for Knowledge Integrity.