This is a GSOC Project Proposal For the Idea: Extension to identify and delete spam pages
Name: Arindam Padhy
irc nick: d3m3nt3r
Time Zone: UTC+5:30
Typical working hours: 9PM to 3AM before 23rd April, 3PM to 6PM after 23rd April (Indian Standard Time)
There are quite a few MediaWiki extensions to prevent spam, and some extensions that let you delete pages en masse.
What MediaWiki doesn't have yet is a capability to deal well with spam that's already in place on the wiki.
The Nuke extension lets you do a mass deletion on all pages created by a single user or IP address, but that's not too helpful because spammers tend to switch quickly from one user/IP address to another,
perhaps to get around such tools.
At present the spam detection system is not that efficient as it does not avoid certain spams but still after the addition of rel="nofollow" there have been a slight reduction in spams but still there is a lot of
types of spams to deal with. My project will try to deal with it all.
My solution will be that I will build the extension in two phases.
1.Different Spam detection
2.Deletion of the Spam
3.User/Client Spam Reporting System
1.Redirection to different Before loading the original website.
2.Presence of a large number of images at a perticular part of the page
3.Lesser amount of wiki text
4.Plain unwanted Text
5.Large number of links
6.Large amount of changes made recently in the source code
7.Grammatical Errors on the page including mispelled words etc
8.Similar phrases being used continuously.
Before 27th april: Request to mediawiki people for a gerrit repository for the extension. Setting up of basic design of the solution
27th April to 25th May: Interaction with the community members , implementing any changes /improvements to the solution of the idea.
25th May: Official Coding Begins for GSOC
25th May to 4th June: Building an algorithm for Detecting spams on the page
5th June to 6th June : Coding of the algorithm
7th June to 25th June : Work on the algorithm to deal with the different types of spams and to delete them
26th June to 28th June : Cleaning up of code and Finalizing minute details before test run
29th June to 1st July : Testing up of Extension on different browsers ,whether the code is running properly on each browser or not.
1st July to 11th July : Implementing different spams on the extension by the mediawiki members to check whether the code is efficient enough to deal with all spams.
12th July to 15th July : Changes made to the extension if any required after the implementation
16th July: Wrapping up of all the parts of the extension
17th July to 19th July: Testing and Removing of any further bugs if found
20th July : Improving Documentation and finalizing Code
My project will be done in two parts as I had mentioned earlier
Part 1:To identify spams from the webpage
This part will mainly deal with finding of available spams on the webpage.Spams can be hidden ,lead the user to a infective site,can open up number of browsers at a time etc
Problem arises on how to detect them because all spams cannot be removed with the same technique. Different spams behave differently.
Using PHP and Java script and Ajax I will be able to find all the possible spams.
Whenever a user requests for the web page and he is redirected to another then he is said to be facing spam issue.This can be avoided. Once the user clicks on the link ,I will keep a variable that will
store the url of that site which user wanted .This variable will be checked with the url of the site which is going to be showed to the user.As soon as the both values don't match then it means that user was
being redirected to spam page.After this occurs the url of the original site will be reloaded.but as again the spam will continue to cause problem hence as soon as we encountered that its a spam site then
we will block that site redirection.Even If this fails then once the page is being redirected then automatically a code will be attached to all the links which will mention that "Do you want to block this site?"
if the user wishes to block that link then it will be blocked by the user and the user the website will be reloaded.This will only happen with links which would open on the time when page was loading.If a
number of pages are opened then the user can report for it through user spam reporting system about which I have mentioned in 3rd part.
In this way the user will be saved from spam technique called cloaking.
Another possible spam can be advertisements popping out each time a site is opened or whenever user clicks on the site, this can be avoided by placing a counter variable.The page source code would be
checked before passing to the user system and if any advertisement keyword or any pop out code is present then that particular code would be deleted and the page will be reloaded.Now comes how to deal
with opening of different sites on different tabs whenever user clicks on the page.This can be dealt by checking on the source code for the keyword "on click" with on click if any url is found thn that
particular url would be deleted from the source code and the web page would be forwarded.Further techniques will be mentioned to the mentor once he asks for.
All this spams will be deleted means they will be transfered into a php file at the server side, which will automatically be cleaned in a particular interval of time.That will be set by the mediawiki members.
After deletion the new Page source will be replaced with the Page source at the server.Hence the page will become spam free.
Before Sending the spams to the php file,that is once the spam is detected the whole source code would copied and kept in the server and whenever the webpage is requested by the client /user these list
of spam codes would be compared with the webpage code that is to be sent to the user.
Part 2: This mainly deals with the development of algorithm to delete the spam from the pages.As i have already mentioned about the deletion of the spam.
Actually first of all the user requests for the web page, then the server transfers the web page,that is the source code.My extension will run on the server side once,if requested by my mentor i can develop
it in such a way that it runs on both client as well as sever side.On the server side first spam will be filtered then the spam free page will be transfered to the user by passing through the above mentioned
conditions.By this way all possible spams will be filtered and the page will be spam free.
possible and will not be detected through this extension ,i will have kept a spam testing time period in my time line where i can fix many errors and make changes on the code to make it possibly deal with
all spams so that user doesn't face any problem.
During the entire development i would like to receive help from my mentor Sir Yaron Koren and medaiwiki members in testing and finalizing my extension.
In Part 3, I will actually design the user spam reporting system in which if the user finds any particular spam on the page that might be of any type the he will categorise the spam under a following list of
categories and will submit a deletion or removal request which will also have a captcha to check whether its a bot or human. Request will thereby be sent to the admin and accordingly if he finds the request
valid then he will take the necessary action.This action and the request category and the location of the spam will be kept as a record so that whenever there is a page request by the client Server system
will check whether that spam is present ( will actually compare both the pages that is the page after deletion of spam and the page which is being loaded by the server system at )or has been deleted .If
during comparison the two pages do not match then again the same deletion function will work.page will be updated regularly using ajax.If again the user finds a new spam at that place or at anyother place
with a different spam or same spam then he will report him to the admin because always location will change and type of spam will change.By type I actually mean here that whether it is a link, image ,plain
text etc.All the details which includes location of spam ,type of spam will be sent to the admin and the same process will continue.Location of spam will automatically be uploaded by the admin once he finds
it suitable to categorise it under spam.
Source code will be pushed on a gerrit repository as soon as I get one
I'm Arindam Padhy second year undergraduate student of Computer Science branch at INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY (IIIT) ,BHUBANESWAR , INDIA
I have done a few networking projects by using php.
I have a huge interest in dealing with malicious things like malware,viruses,spams and my major interest is always in network security.I had already taken training under Hewlett-Packard officials last
summer on Network Management and Security after which i was certified by them.
Apart from this I have been involved in making websites secure by dealing with all the possible security issues.
I have designed Websites for my school,college festivals.
How did you hear about the program?
I heard about GSOC from my friends.
Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?
By 23rd april my 2nd year final examination will begin an it will last till 10th may.During that time I have to be focussed on my exam.After that my summer vacations will begin and i have 3months summer
vacation ending at august.As soon as my summer vacation begins i will be able to give ful commitment to my project and i assure you to follow my timeline strictly without any deliberate delays.
We advise all candidates eligible to Google Summer of Code and FOSS Outreach Program for Women to apply for both programs. Are you planning to apply to both programs and, if so, with what
No, I would only like to apply for GSOC 2015.
What does making this project happen mean to you?
It means a lot to me,firstly it is related to something which I always wanted to do.This will further lead me to gain experience and know a lot of stuff about media wiki and how it works.
How would you like to contribute to mediawiki after GSOC 2015?
Even after the end of GSOC 2015 I would like to contribute to media wiki in all possible manners that I will be useful for, Specially on the Security issues which i consider as my strength because of which i
am applying for this project.
This would be my first experience with media wiki, but i had a few previous experiences with phpmyadmin where i had tried on the project for user iterface development.I had already begun my work on
that but unfortunately I was not selected.
But still I finished my work and implemented the patch on my machine.Basically it was a work on server variables.
As i had already mentioned my major interest is security and malware testing.In coming future I will be trying to a certificate for doing a project by CISCO.
I have started using mediawiki since last year and have been planning to work on this project since then.
Projects that I have worked on:
1.Security issues on Linux systems
2.Worked on the security of the open source academic information system of my college know as hibiscus
[[ URL | =https://hib.iiit-bh.ac.in/Hibiscus/Login/?client=iiit ]]