Page MenuHomePhabricator

[Open question] How to more effectively detect spambot accounts?
Open, MediumPublic

Description

Problem description
Problem: Efficient ways to detect cross-wiki spambot accounts.

Spambot accounts create one-off pages on small wikis and require a lot of steward time to identify and lock. They use a series of proxies and VPNs that makes IP blocks impractical.

Possible solutions:

  • A better CAPTCHA might be able to prevent more account creation by spambots.
  • Improved abusefilters may help, though a truly global abusefilter infrastructure is missing.
  • Anti-spam technology may be improved.

What information is missing?

  • The working of abusefilter need to be documented for this research.
  • What anti-spam technology do we use now?

Event Timeline

leila triaged this task as Medium priority.Nov 20 2019, 12:45 AM
leila created this task.

@leila , If this task is still opened I would like to help.

In addition to captchas, I see 2 possible ways how to handle wiki spambots depending on the scale of the issue.

Small scale
  • if the issue is specific to small wikis and creation of one-off pages, then simpler/statistical methods should be enough:
Medium/Large scale
  • if there are pages/series of pages generated by spambots or groups of spambots, then machine leraning approach could be used
    • use [IP address, username, requested webpage URL, session identity, timestamp] information from web server logs to identify wiki usage patterns by spambots
    • gather, create and preprocess features dataset specific to wikimedia and apply ML methods for SPAM detection:
    • example of featureset from "Emotional Bots Content-based Spammer Detection on Social Media":
      Emotional Bots Content-based Spammer Detection on Social Media.PNG (335×423 px, 34 KB)
    • add SPAM topic detection to topic detection models : https://github.com/wikimedia/drafttopic
    • for ML spambot detection methods reference see Reference section: https://github.com/IUNetSci/botometer-python

Note:
Please let me know what do you think. Do you have any examples of spam pages? Do you prefer simple or more robust solution?