Page MenuHomePhabricator

Extension to identify and delete spam pages
Closed, ResolvedPublic

Description

Synopsis

Building a MediaWiki extension (as a GSoC project) that can identify existing spam pages and then present an admin with options to 'delete' or 'mark as not spam' (among other possible options, such as 'mark for review') the pages that the extension lists out.

Primary mentor: @Yaron_Koren
Co-mentor: @jan

Details

Unlike some existing extensions which use IPs/usernames for mass deletion (eg. Extension:Nuke) or look for particular URLs during page creation (eg. Extension:SpamBlacklist), the new extension would use some (broader) basic rules to search for currently existing possible spam pages which could include:

  • Disproportionately large number of external links
  • Disproportionately little wikitext
  • Edit history: A fully formed page created in one go
  • Disproportionately large number of images (or other embedded files). Possible things to look at:
    • Large percentage of them have been newly uploaded
    • Could also list the images as likely spam
  • Other possibles
    • Large number of misspelled words? (Would involve dictionaries with additions based on the wiki's content - proper nouns, etc.; would also be very slow)
    • Significant use of words/phrases from a blacklist (probably already exists in an extension)

Minimum Viable Product

A crude prototype of the extension which can perform the basic function of searching for pages based on a single rule and then list out the matching pages.

Timeline from Original Proposal

April 27th to May 25thCommunity bonding period. Find out more about the kind of issues 3rd party wikis have with spam and get community's opinion on the same. Also ask for a Gerrit repo and a Labs instance.
Week 1 (May 25th - May 31st)Create extension's skeleton; decide extension's structure
Weeks 2, 3 (June 1st - June 14th)Work on minimum viable product
Week 4 (June 15th - June 21st)Finish mvp; ask for community review
Week 5 (June 22nd - June 28th)Reconsider extension structure and procedure based on review of mvp; write basic documentation of mvp; also mid-term evaluation
Weeks 6 - 10 (June 28th - July 26th)Add in other metrics for identifying spam; provide filtering options on list page; figure out what tests need to be written
Weeks 11, 12 (July 27th - August 9th)Add in AJAX request; write proper documentation; attempt to implement “other possible” metrics (as mentioned above)
Weeks 13, 14 (August 10th - August 20th)Wrap-up: minor bug fixes, write tests, review and edit documentation

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@jan, you're welcome to be a mentor on any other project that interests you here: https://phabricator.wikimedia.org/project/board/1042/ :)

Hi,

I am Pakshal H Dhelaria,pursuing 3rd of engineering at PES Institute of Technology ,Bangalore.

While I was going through the GSoC ideas home page, I came across this idea of 'Extension to identify and delete spam pages'.
I am interested in building a prototype for the same.

From what I understand from the idea description, the features would include:
Identifying the spam page using machine learning algorithm.
Deleting them.

Few questions that I have:
What are features to be considered while identifying the spam page?
What is the expected efficiency?

Can you please help me with getting started on this idea ?

Thanks,
Pakshal

What are features to be considered while identifying the spam page?

Is that covered by the initial description of this task above, "use some logic to try to figure out which ones were spam pages"? If not, could you elaborate?

What is the expected efficiency?

What kind of efficiency measurement do you have in mind?

@Pakshal94 - and anyone else who might be interested in doing this project: I've already been contacted by some very strong candidates; so if you haven't already talked to me via email, and/or already submitted a patch for one of the micro-tasks, your chance of getting chosen for this project is unfortunately quite low.

In other words, if you haven't communicated with me yet, please find another project. Sorry about that.

who is going to work on this project?

We're holding an IRC meeting on March 25, at 1700 UTC for prospective GSoC and Outreachy participants with Wikimedia, on #wikimedia-office channel. Do join us!

Hello! The IRC meeting tomorrow has been shifted to #wikimedia-ect channel. Looking forward to seeing you there. :)

Change 214303 had a related patch set uploaded (by Polybuildr):
[WIP] Initial work on SmiteSpam extension

https://gerrit.wikimedia.org/r/214303

Change 214303 merged by Polybuildr:
Initial work on SmiteSpam extension

https://gerrit.wikimedia.org/r/214303

Given the variance in 'spam' pages, it seems a bit short-sighted to hardcode 'checkers' into the extension.
I'd like to see it integrated with an AI-based service such as those provided by https://github.com/wiki-ai.

@Ricordisamoa - thanks for the suggestion. It would be neat to look into AI stuff; I don't know if there will be time during GSoC to do it, but we'll see. And I wouldn't call it short-sighted to hardcode the spam-checking code; I'd call it pragmatic.

I wouldn't call it short-sighted to hardcode the spam-checking code; I'd call it pragmatic.

Sorry if it sounded offensive. I've seen many GSoC projects end up in dust for being 'pragmatic' and not caring about the long term.
Let's avoid reinventing the wheel when we can have rockets...

Hello!

End of GSoC is fast approaching. 17 August is "Suggested pencils down" deadline and 21 August is "Firm pencils down" deadline. It is expected that you don't dive into new features which might take longer than two weeks to complete and instead work on polishing up your project, testing thoroughly and getting your code merged into the main branch. I hope this project is almost complete so you can merge it and make it available to everyone as quickly as possible. :)

A few questions (for both mentors and student):

  • Are you confident in completing the project on time?
  • By when do you think you can merge the code, if at all?
  • Are there any major blockers or important missing features?

We are looking for projects which are (nearly) complete to feature on our post on Wikimedia and Google OSPO's blogs (for example: http://google-opensource.blogspot.in/2015/02/google-summer-of-code-wrap-up-processing.html). If you're interested in getting yours up there, hurry up and get this finished!

The hard deadline on getting code merged is September. T101393: Goal: All completed GSoC and Outreachy projects have code merged and deployed by September for details.

We'll be asking the students to demo their projects towards the end of the program as well.

Good luck!

Hi Niharika,

All the real functionality for this project is already created; what's left now is mostly improving the user interface. You can see Vivek's code here:

http://git.wikimedia.org/tree/mediawiki%2Fextensions%2FSmiteSpam

Awesome. @polybuildr, could you write up a short paragraph about the project so we can include it in the blog post? Don't forget to add links, wherever required. Thanks!

@NiharikaKohli, sure! Where do I put the write-up and by when?

Hi, I have associated two blocked-by tasks with this project.

For the student:

  1. Please go through the checklist in the end-term evaluation and fill out the fields which require any links. The checkboxes are for the mentor(s) only. Adding information on the past projects page is your task.
  2. Ensure that you have completed all the items listed in the end-term evaluation task. If there's a strong reason about why a particular item was not completed, please comment on the task and we shall look into it.
  3. Wrap-up report is mandatory and so is a demo-able link to the project (either in production or in a demo server).
  4. If you want your project to be featured in the blogpost on the Google OSPO blog, kindly comment back with a short, catchy description of the project along with a screenshot.

@NiharikaKohli, I'd like to wait a little longer to resolve this, probably a week or so. There are a few tasks remaining on MediaWiki-extensions-SmiteSpam that need to be completed, after which I'll announce this extension on mediawiki-l (as per a discussion with Yaron and Jan) and mark this task as resolved too. Does that seem alright?

And finally, version 0.1.0 was released in 902eb3142. :D

I have my midterm exams right now, but in the next couple of days I'll send out a mail on the mediawiki-l mailing list announcing the extension.

Thanks a lot, @Yaron_Koren and @jan for all the help! :D

What are the next steps to assure that this extension lives long, with real users and maintenance?

Should this extension be proposed to be deployed in Wikimedia projects?

This seems like three separate questions, but my answer to the last one is: I'm not aware of any Wikimedia wikis that have a lot of spam pages, but if there are, then yes, definitely - SmiteSpam is the best tool (maybe the only tool) for the job.

Regarding users: I sent out a mail to mediawiki-l announcing the extension. Hopefully wiki admins who have this problem on their wikis will make use of the extension. At the moment, WikiApiary shows 2 wikis using the extension.

Regarding maintenance: Personally, I intend to continue maintaining this extension, so I'm hoping that the extension will not die because of a lack of maintenance. It would be great if others contribute - one contributor new to MediaWiki already fixed an good first task bug in the extension at T108543 - hopefully there will be more in the future.

SmiteSpam wasn't intended for Wikimedia wikis as per the original plan, simply because administration there seems to be going well - at least it doesn't look to me like any are infested with spam. However, as Yaron said, if there's a wiki that is, then yes, SmiteSpam should be proposed as a solution.

Thank you. I was asking about your plans, without any intention of suggesting anything. It's all clear now

MediaWiki-extension-requests is an appropriate tag here: such extension was missing and you developed it. Thank you! :)