There are quite a few MediaWiki extensions to prevent spam, and some extensions that let you delete pages //en masse//. What MediaWiki doesn't have yet is a capability to deal well with spam that's already in place on the wiki. The Nuke extension lets you do a mass deletion on all pages created by a single user or IP address, but that's not too helpful because spammers tend to switch quickly from one user/IP address to another, perhaps to get around such tools. (Also, Nuke only works on recent changes and has little filtering capability; see bugs T33858, T56208, T68447.) This planned extension would instead go through all the pages in the wiki, and use some logic to try to figure out which ones were spam pages; it would then display an interface to an administrator to be able to delete these pages - with a checkbox for each to let admins toggle the deletion of each one.### Synopsis
Identifying the spam pages should actually not be that difficult to do - from my experience, spam pages tend to be rather different from real wiki pages in content (almost complete lack of wikitext, other than external URLs), page history (created fully formed in a single edit)Building a MediaWiki extension (as a GSoC project) that can identify existing spam pages and then present an admin with options to 'delete' or 'mark as not spam' (among other possible options, and so onsuch as 'mark for review') the pages that the extension lists out.
Primary mentor: @Yaron_Koren
Unlike some existing extensions which use IPs/usernames for mass deletion (eg. Extension:Nuke) or look for particular URLs during page creation (eg. Extension:SpamBlacklist), the new extension would use some (broader) basic rules to search for **currently existing possible spam pages** which could include:
- Disproportionately large number of external links
- Disproportionately little wikitext
- Edit history: A fully formed page created in one go
- Disproportionately large number of images (or other embedded files). Possible things to look at:
- Large percentage of them have been newly uploaded
- Could also list the images as likely spam
- Other possibles
- Large number of misspelled words? (Would involve dictionaries with additions based on the wiki's content - proper nouns, etc.; would also be very slow)
- Significant use of words/phrases from a blacklist (probably already exists in an extension)
###Minimum Viable Product
A crude prototype of the extension which can perform the basic function of searching for pages based on a single rule and then list out the matching pages.
### Timeline from Original Proposal
|April 27th to May 25th| Community bonding period. Find out more about the kind of issues 3rd party wikis have with spam and get community's opinion on the same. Also ask for a Gerrit repo and a Labs instance.
|Week 1 (May 25th - May 31st)|Create extension's skeleton; decide extension's structure
|Weeks 2, 3 (June 1st - June 14th)|Work on minimum viable product
|Week 4 (June 15th - June 21st)|Finish mvp; ask for community review
Other mentors: (optional,|Week 5 (June 22nd - June 28th)|Reconsider extension structure and procedure based on review of mvp; write basic documentation of mvp; Phabricator username)also mid-term evaluation
Estimated project time for a senior contributor: 2-3 weeks|Weeks 11, 12 (July 27th - August 9th)|Add in AJAX request; write proper documentation; attempt to implement “other possible” metrics (as mentioned above)
Microtasks: T90637|Weeks 13, 14 (August 10th - August 20th)|Wrap-up: minor bug fixes, T91092write tests, T91222review and edit documentation