- Name: Vivek Ghaisas
- Email: firstname.lastname@example.org
- IRC (freenode) nick: polybuildr
- Web pages: vghaisas.me (also has an inactive /blog), MediaWiki user page.
- Location: I'll be in two places this summer.
- India (UTC+5:30) (Place of residence and education)
- Qatar (UTC+3:00)
- Typical working hours: 6 - 8 hours a day, between 10 AM and 1 AM (local time)
Building a MediaWiki extension that can identify existing spam on pages and then present an admin with options to 'delete', 'mark for review' or 'mark as not spam' (among other possible options) the pages/edits that the extension lists out. The extension would also remember pages/edits earlier marked as 'not spam'.
Unlike some existing extensions which use IPs/usernames (eg. Extension:Nuke) or look for particular URLs (eg. Extension:SpamBlacklist), the new extension would use some (broader) basic rules to search for possible spam pages/edits which could include:
- Disproportionately large number of links
- Disproportionately little wikitext
- Edit history: A page with a lot of content getting created in one go, for example, or a very large edit being made.
- Disproportionately large number of images (or other embedded files). Possible things to look at:
- Large percentage of them have been newly uploaded
- Could also list the images as likely spam
- Other possibles
- Large number of misspelled words? (Would involve dictionaries with additions based on the wiki's content - proper nouns, etc.; would also be very slow)
- Significant use of words/phrases from a blacklist (probably already exists in an extension)
- Admin goes to extension's main special page and hits 'Find spam'.
- An AJAX request is made to one of the extension's pages, which does the actual searching. The main special page shows the progress of the search, possibly using a progress bar.
- Extension looks through wikipages' content, searching for cases of the above.
- Based on the metrics above, the extension generates a list of 'probably spam' pages/edits.
- Display the list with options for filtering. Results will be grouped by user/IP.
- Admin marks some pages for deletion, some others as 'not spam', some others as 'need review'. The admin can also mark actions to be taken for all edits by a particular user/IP.
- Extension performs appropriate actions.
Points to keep in mind
- Remember which pages have already been checked and don't recheck if not changed. Could be made configurable.
- Try to ensure that checks do not take too much time. Could also provide admin with options for picking which metrics to use while performing a particular search.
- Provide ways to edit dictionaries (if used) and word blacklist.
The primary deliverable for this project is the extension itself. This can be broken down into a few components, though.
- Minimum viable product (mvp)
A crude prototype of the extension which can perform the basic function of searching for pages based on a single rule and then list out the matching pages.
- Extendable rules/rule list
Write the single rule being checked by the mvp in a modular fashion such that it allows for easy addition of possible rules/metrics.
- Complete extension
Use the above two to build the complete extension.
|April 27th to May 25th||Community bonding period. Find out more about the kind of issues 3rd party wikis have with spam and get community's opinion on the same. Also ask for a Gerrit repo and a Labs instance.|
|Week 1 (May 25th - May 31st)||Create extension's skeleton; decide extension's structure|
|Weeks 2, 3 (June 1st - June 14th)||Work on minimum viable product|
|Week 4 (June 15th - June 21st)||Finish mvp; ask for community review|
|Week 5 (June 22nd - June 28th)||Reconsider extension structure and procedure based on review of mvp; write basic documentation of mvp; also mid-term evaluation|
|Weeks 6 - 10 (June 28th - July 26th)||Add in other metrics for identifying spam; provide filtering options on list page; figure out what tests need to be written|
|Weeks 11, 12 (July 27th - August 9th)||Add in AJAX request; write proper documentation; attempt to implement “other possible” metrics (as mentioned above)|
|Weeks 13, 14 (August 10th - August 20th)||Wrap-up: minor bug fixes, write tests, review and edit documentation|
As I did with my earlier patches, I will communicate with the community on #mediawiki-dev and also on the relevant phabricator task. The source code will be pushed to Gerrit (with possibly a GitHub mirror) and code review will take place there. Discussions with the mentors will probably happen over email (or Conpherence, pending T91392).
- Education: Computer Science and Engineering undergrad student at IIIT-H
- Commitments during duration of the program: No formal commitments
- Why do I want to do this?
Until late last year, none of my work was for FOSS organizations. I contributed to personal projects or worked in small groups. But after starting work on MediaWiki in December last year, I discovered that it's actually a lot of fun to work with the open source community. Hundreds of people from all over the world working on a project, many as volunteers. That realization hit me pretty hard, and I really felt like I had to become part of this.
As for why this particular project: Third party wikis commonly lack the manpower to actively patrol their wikis and keep away spam, thus becoming rather easy targets for wily spammers. If a new extension can help them with the herculean task of keeping away spam, then it should certainly be done.