Synopsis
Building a MediaWiki extension (as a GSoC project) that can identify existing spam pages and then present an admin with options to 'delete' or 'mark as not spam' (among other possible options, such as 'mark for review') the pages that the extension lists out.
Primary mentor: @Yaron_Koren
Co-mentor: @jan
Details
Unlike some existing extensions which use IPs/usernames for mass deletion (eg. Extension:Nuke) or look for particular URLs during page creation (eg. Extension:SpamBlacklist), the new extension would use some (broader) basic rules to search for currently existing possible spam pages which could include:
- Disproportionately large number of external links
- Disproportionately little wikitext
- Edit history: A fully formed page created in one go
- Disproportionately large number of images (or other embedded files). Possible things to look at:
- Large percentage of them have been newly uploaded
- Could also list the images as likely spam
- Other possibles
- Large number of misspelled words? (Would involve dictionaries with additions based on the wiki's content - proper nouns, etc.; would also be very slow)
- Significant use of words/phrases from a blacklist (probably already exists in an extension)
Minimum Viable Product
A crude prototype of the extension which can perform the basic function of searching for pages based on a single rule and then list out the matching pages.
Timeline from Original Proposal
April 27th to May 25th | Community bonding period. Find out more about the kind of issues 3rd party wikis have with spam and get community's opinion on the same. Also ask for a Gerrit repo and a Labs instance. |
Week 1 (May 25th - May 31st) | Create extension's skeleton; decide extension's structure |
Weeks 2, 3 (June 1st - June 14th) | Work on minimum viable product |
Week 4 (June 15th - June 21st) | Finish mvp; ask for community review |
Week 5 (June 22nd - June 28th) | Reconsider extension structure and procedure based on review of mvp; write basic documentation of mvp; also mid-term evaluation |
Weeks 6 - 10 (June 28th - July 26th) | Add in other metrics for identifying spam; provide filtering options on list page; figure out what tests need to be written |
Weeks 11, 12 (July 27th - August 9th) | Add in AJAX request; write proper documentation; attempt to implement “other possible” metrics (as mentioned above) |
Weeks 13, 14 (August 10th - August 20th) | Wrap-up: minor bug fixes, write tests, review and edit documentation |