Page MenuHomePhabricator

Extension to identify and delete spam pages (GSoC proposal)
Closed, DuplicatePublic

Description

Personal information

  • Name: Vivek Ghaisas
  • Email: v.a.ghaisas@gmail.com
  • IRC (freenode) nick: polybuildr
  • Web pages: vghaisas.me (also has an inactive /blog), MediaWiki user page.
  • Location: I'll be in two places this summer.
    • India (UTC+5:30) (Place of residence and education)
    • Qatar (UTC+3:00)
  • Typical working hours: 6 - 8 hours a day, between 10 AM and 1 AM (local time)

Project

Synopsis

Building a MediaWiki extension that can identify existing spam on pages and then present an admin with options to 'delete', 'mark for review' or 'mark as not spam' (among other possible options) the pages/edits that the extension lists out. The extension would also remember pages/edits earlier marked as 'not spam'.

Details

Unlike some existing extensions which use IPs/usernames (eg. Extension:Nuke) or look for particular URLs (eg. Extension:SpamBlacklist), the new extension would use some (broader) basic rules to search for possible spam pages/edits which could include:

  • Disproportionately large number of links
  • Disproportionately little wikitext
  • Edit history: A page with a lot of content getting created in one go, for example, or a very large edit being made.
  • Disproportionately large number of images (or other embedded files). Possible things to look at:
    • Large percentage of them have been newly uploaded
    • Could also list the images as likely spam
  • Other possibles
    • Large number of misspelled words? (Would involve dictionaries with additions based on the wiki's content - proper nouns, etc.; would also be very slow)
    • Significant use of words/phrases from a blacklist (probably already exists in an extension)

Workflow

  1. Admin goes to extension's main special page and hits 'Find spam'.
  2. An AJAX request is made to one of the extension's pages, which does the actual searching. The main special page shows the progress of the search, possibly using a progress bar.
  3. Extension looks through wikipages' content, searching for cases of the above.
  4. Based on the metrics above, the extension generates a list of 'probably spam' pages/edits.
  5. Display the list with options for filtering. Results will be grouped by user/IP.
  6. Admin marks some pages for deletion, some others as 'not spam', some others as 'need review'. The admin can also mark actions to be taken for all edits by a particular user/IP.
  7. Extension performs appropriate actions.
Points to keep in mind
  • Remember which pages have already been checked and don't recheck if not changed. Could be made configurable.
  • Try to ensure that checks do not take too much time. Could also provide admin with options for picking which metrics to use while performing a particular search.
  • Provide ways to edit dictionaries (if used) and word blacklist.

Deliverables

The primary deliverable for this project is the extension itself. This can be broken down into a few components, though.

  • Minimum viable product (mvp)

    A crude prototype of the extension which can perform the basic function of searching for pages based on a single rule and then list out the matching pages.
  • Extendable rules/rule list

    Write the single rule being checked by the mvp in a modular fashion such that it allows for easy addition of possible rules/metrics.
  • Complete extension

    Use the above two to build the complete extension.

Timeline

April 27th to May 25thCommunity bonding period. Find out more about the kind of issues 3rd party wikis have with spam and get community's opinion on the same. Also ask for a Gerrit repo and a Labs instance.
Week 1 (May 25th - May 31st)Create extension's skeleton; decide extension's structure
Weeks 2, 3 (June 1st - June 14th)Work on minimum viable product
Week 4 (June 15th - June 21st)Finish mvp; ask for community review
Week 5 (June 22nd - June 28th)Reconsider extension structure and procedure based on review of mvp; write basic documentation of mvp; also mid-term evaluation
Weeks 6 - 10 (June 28th - July 26th)Add in other metrics for identifying spam; provide filtering options on list page; figure out what tests need to be written
Weeks 11, 12 (July 27th - August 9th)Add in AJAX request; write proper documentation; attempt to implement “other possible” metrics (as mentioned above)
Weeks 13, 14 (August 10th - August 20th)Wrap-up: minor bug fixes, write tests, review and edit documentation

Communication

As I did with my earlier patches, I will communicate with the community on #mediawiki-dev and also on the relevant phabricator task. The source code will be pushed to Gerrit (with possibly a GitHub mirror) and code review will take place there. Discussions with the mentors will probably happen over email (or Conpherence, pending T91392).

About Me

  • Education: Computer Science and Engineering undergrad student at IIIT-H
  • Commitments during duration of the program: No formal commitments
  • Why do I want to do this?

    Until late last year, none of my work was for FOSS organizations. I contributed to personal projects or worked in small groups. But after starting work on MediaWiki in December last year, I discovered that it's actually a lot of fun to work with the open source community. Hundreds of people from all over the world working on a project, many as volunteers. That realization hit me pretty hard, and I really felt like I had to become part of this.

    As for why this particular project: Third party wikis commonly lack the manpower to actively patrol their wikis and keep away spam, thus becoming rather easy targets for wily spammers. If a new extension can help them with the herculean task of keeping away spam, then it should certainly be done.

Event Timeline

polybuildr claimed this task.
polybuildr raised the priority of this task from to Normal.
polybuildr updated the task description. (Show Details)
polybuildr added subscribers: lucky, Florian, Ricordisamoa and 28 others.
polybuildr updated the task description. (Show Details)Mar 22 2015, 12:15 AM
polybuildr set Security to None.
jan added a comment.Mar 26 2015, 5:16 PM

First, your proposal is very nice and I would be happy to support you as mentor. It is very nice that you have already some experience with MediaWiki development :-)

A short question:

Typical working hours: 6 - 8 hours a day, between 10 AM and 1 AM.

Is this UTC or local time?

Best regards,
Jan

jan awarded a token.Mar 26 2015, 5:16 PM

Thanks, @jan. :)

As for:

Typical working hours: 6 - 8 hours a day, between 10 AM and 1 AM.

Sorry for not clarifying that, I meant local time. I'll update the task description too.

polybuildr updated the task description. (Show Details)Mar 26 2015, 5:25 PM
lucky added a comment.Mar 27 2015, 4:36 PM

@jan
sir have you seen my proposal?any comments on that sir
https://phabricator.wikimedia.org/T93480

@lucky: Jan is already CCed on the task that you linked so there is no need to ask in this task instead of the one that you linked. Thanks.

polybuildr updated the task description. (Show Details)May 10 2015, 11:46 AM