Page MenuHomePhabricator

GSOC Project Proposal for the Idea : Extension to identify and delete spam pages
Closed, DuplicatePublic

Description

This is a GSOC Project Proposal For the Idea: Extension to identify and delete spam pages

Profile Information

Name: Arindam Padhy

Email: arindamadri1995@gmail.com

irc nick: d3m3nt3r

Location: India

Time Zone: UTC+5:30

Typical working hours: 9PM to 3AM before 23rd April, 3PM to 6PM after 23rd April (Indian Standard Time)

Synopsis:

There are quite a few MediaWiki extensions to prevent spam, and some extensions that let you delete pages en masse.
What MediaWiki doesn't have yet is a capability to deal well with spam that's already in place on the wiki.
The Nuke extension lets you do a mass deletion on all pages created by a single user or IP address, but that's not too helpful because spammers tend to switch quickly from one user/IP address to another,
perhaps to get around such tools.

At present the spam detection system is not that efficient as it does not avoid certain spams but still after the addition of rel="nofollow" there have been a slight reduction in spams but still there is a lot of
types of spams to deal with. My project will try to deal with it all.

Possible Mentors:
Mentor: Yaron_Koren
Co-Mentor: jan

Solution:

My solution will be that I will build the extension in two phases.
1.Different Spam detection
2.Deletion of the Spam
3.User/Client Spam Reporting System
Possible Problems:

1.Redirection to different Before loading the original website.
2.Presence of a large number of images at a perticular part of the page
3.Lesser amount of wiki text
4.Plain unwanted Text
5.Large number of links
6.Large amount of changes made recently in the source code
7.Grammatical Errors on the page including mispelled words etc
8.Similar phrases being used continuously.

Timeline:

Before 27th april: Request to mediawiki people for a gerrit repository for the extension. Setting up of basic design of the solution

27th April to 25th May: Interaction with the community members , implementing any changes /improvements to the solution of the idea.

25th May: Official Coding Begins for GSOC

25th May to 4th June: Building an algorithm for Detecting spams on the page

5th June to 6th June : Coding of the algorithm

7th June to 25th June : Work on the algorithm to deal with the different types of spams and to delete them

26th June to 28th June : Cleaning up of code and Finalizing minute details before test run

29th June to 1st July : Testing up of Extension on different browsers ,whether the code is running properly on each browser or not.

1st July to 11th July : Implementing different spams on the extension by the mediawiki members to check whether the code is efficient enough to deal with all spams.

12th July to 15th July : Changes made to the extension if any required after the implementation

16th July: Wrapping up of all the parts of the extension

17th July to 19th July: Testing and Removing of any further bugs if found

20th July : Improving Documentation and finalizing Code

Participation:
My project will be done in two parts as I had mentioned earlier

Part 1:To identify spams from the webpage

This part will mainly deal with finding of available spams on the webpage.Spams can be hidden ,lead the user to a infective site,can open up number of browsers at a time etc
Problem arises on how to detect them because all spams cannot be removed with the same technique. Different spams behave differently.
Using PHP and Java script and Ajax I will be able to find all the possible spams.

Whenever a user requests for the web page and he is redirected to another then he is said to be facing spam issue.This can be avoided. Once the user clicks on the link ,I will keep a variable that will
store the url of that site which user wanted .This variable will be checked with the url of the site which is going to be showed to the user.As soon as the both values don't match then it means that user was
being redirected to spam page.After this occurs the url of the original site will be reloaded.but as again the spam will continue to cause problem hence as soon as we encountered that its a spam site then
we will block that site redirection.Even If this fails then once the page is being redirected then automatically a code will be attached to all the links which will mention that "Do you want to block this site?"
if the user wishes to block that link then it will be blocked by the user and the user the website will be reloaded.This will only happen with links which would open on the time when page was loading.If a

number of pages are opened then the user can report for it through user spam reporting system about which I have mentioned in 3rd part.
In this way the user will be saved from spam technique called cloaking.

Another possible spam can be advertisements popping out each time a site is opened or whenever user clicks on the site, this can be avoided by placing a counter variable.The page source code would be
checked before passing to the user system and if any advertisement keyword or any pop out code is present then that particular code would be deleted and the page will be reloaded.Now comes how to deal
with opening of different sites on different tabs whenever user clicks on the page.This can be dealt by checking on the source code for the keyword "on click" with on click if any url is found thn that
particular url would be deleted from the source code and the web page would be forwarded.Further techniques will be mentioned to the mentor once he asks for.

All this spams will be deleted means they will be transfered into a php file at the server side, which will automatically be cleaned in a particular interval of time.That will be set by the mediawiki members.
After deletion the new Page source will be replaced with the Page source at the server.Hence the page will become spam free.

Before Sending the spams to the php file,that is once the spam is detected the whole source code would copied and kept in the server and whenever the webpage is requested by the client /user these list

of spam codes would be compared with the webpage code that is to be sent to the user.

Part 2: This mainly deals with the development of algorithm to delete the spam from the pages.As i have already mentioned about the deletion of the spam.

Actually first of all the user requests for the web page, then the server transfers the web page,that is the source code.My extension will run on the server side once,if requested by my mentor i can develop
it in such a way that it runs on both client as well as sever side.On the server side first spam will be filtered then the spam free page will be transfered to the user by passing through the above mentioned
conditions.By this way all possible spams will be filtered and the page will be spam free.

The coding will completely be done using PHP ,Javascript and Ajax. Deletion procedure will be made efficient and will not make the page loading slower.Apart from these if any further types of spams that

are
possible and will not be detected through this extension ,i will have kept a spam testing time period in my time line where i can fix many errors and make changes on the code to make it possibly deal with
all spams so that user doesn't face any problem.

During the entire development i would like to receive help from my mentor Sir Yaron Koren and medaiwiki members in testing and finalizing my extension.

Part 3:
In Part 3, I will actually design the user spam reporting system in which if the user finds any particular spam on the page that might be of any type the he will categorise the spam under a following list of

categories and will submit a deletion or removal request which will also have a captcha to check whether its a bot or human. Request will thereby be sent to the admin and accordingly if he finds the request

valid then he will take the necessary action.This action and the request category and the location of the spam will be kept as a record so that whenever there is a page request by the client Server system

will check whether that spam is present ( will actually compare both the pages that is the page after deletion of spam and the page which is being loaded by the server system at )or has been deleted .If

during comparison the two pages do not match then again the same deletion function will work.page will be updated regularly using ajax.If again the user finds a new spam at that place or at anyother place

with a different spam or same spam then he will report him to the admin because always location will change and type of spam will change.By type I actually mean here that whether it is a link, image ,plain

text etc.All the details which includes location of spam ,type of spam will be sent to the admin and the same process will continue.Location of spam will automatically be uploaded by the admin once he finds

it suitable to categorise it under spam.

Design:

Source Code
Source code will be pushed on a gerrit repository as soon as I get one

About Me

I'm Arindam Padhy second year undergraduate student of Computer Science branch at INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY (IIIT) ,BHUBANESWAR , INDIA
My major interest is in web languages(php,javascript,html).
I have done a few networking projects by using php.
I have a huge interest in dealing with malicious things like malware,viruses,spams and my major interest is always in network security.I had already taken training under Hewlett-Packard officials last
summer on Network Management and Security after which i was certified by them.
Apart from this I have been involved in making websites secure by dealing with all the possible security issues.
I have designed Websites for my school,college festivals.

How did you hear about the program?
I heard about GSOC from my friends.

Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?

By 23rd april my 2nd year final examination will begin an it will last till 10th may.During that time I have to be focussed on my exam.After that my summer vacations will begin and i have 3months summer
vacation ending at august.As soon as my summer vacation begins i will be able to give ful commitment to my project and i assure you to follow my timeline strictly without any deliberate delays.

We advise all candidates eligible to Google Summer of Code and FOSS Outreach Program for Women to apply for both programs. Are you planning to apply to both programs and, if so, with what
organization(s)?

No, I would only like to apply for GSOC 2015.

What does making this project happen mean to you?

It means a lot to me,firstly it is related to something which I always wanted to do.This will further lead me to gain experience and know a lot of stuff about media wiki and how it works.

How would you like to contribute to mediawiki after GSOC 2015?

Even after the end of GSOC 2015 I would like to contribute to media wiki in all possible manners that I will be useful for, Specially on the Security issues which i consider as my strength because of which i
am applying for this project.

Past Experience

This would be my first experience with media wiki, but i had a few previous experiences with phpmyadmin where i had tried on the project for user iterface development.I had already begun my work on
that but unfortunately I was not selected.

But still I finished my work and implemented the patch on my machine.Basically it was a work on server variables.

As i had already mentioned my major interest is security and malware testing.In coming future I will be trying to a certificate for doing a project by CISCO.

I have started using mediawiki since last year and have been planning to work on this project since then.

Projects that I have worked on:
1.Security issues on Linux systems
2.Worked on the security of the open source academic information system of my college know as hibiscus[1]
3.Malware Testing

[[ URL | [1]=https://hib.iiit-bh.ac.in/Hibiscus/Login/?client=iiit ]]

Event Timeline

lucky created this task.Mar 21 2015, 6:49 PM
lucky claimed this task.
lucky raised the priority of this task from to Normal.
lucky updated the task description. (Show Details)
lucky added a subscriber: lucky.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 21 2015, 6:49 PM
lucky updated the task description. (Show Details)Mar 21 2015, 6:56 PM
lucky removed a project: Outreachy-Round-10.
lucky set Security to None.
lucky added a project: Outreachy-Round-10.
lucky updated the task description. (Show Details)Mar 22 2015, 6:26 AM
lucky added subscribers: MZMcBride, rana, phoenix_du_42 and 29 others.
lucky updated the task description. (Show Details)Mar 22 2015, 6:29 AM
lucky updated the task description. (Show Details)
lucky updated the task description. (Show Details)Mar 22 2015, 6:36 AM

@lucky you should spend some time understanding the project idea more. Spam in context of wiki pages does not necessarily mean spam links and pop-up ads. It can even be plain unwanted text. How do you plan to detect and correct that?
What happens if your algorithm incorrectly identifies a page as spam?

lucky added a comment.EditedMar 22 2015, 3:22 PM

Actually maa'm those are some of the types of spams possible which i have mentioned .I have already stated even if there are any other types of spams which are able to pass through my spam filter, i will develop methods for dealing with it also.
Coming back to your question on how to deal with plain unwanted text, i will design it in such a way that whenever the filter finds a large collection of non wiki text or if it finds a large number of mispelled words,or possibly a lot of new images being put up by any unanimous Ip these all will be dealt with.
Even i will make the extension to keep a track of the edit history to check if a lot of changes have been done in the code.
Even if Unrelated Commands are if being used then they would also be checked.
Even User's would be given an option to report for spams that is when there is presence of unwanted text.
presence of lot of images, links, hyperlinks would also be considered under spam.

lucky updated the task description. (Show Details)Mar 22 2015, 7:44 PM
asb removed a subscriber: asb.Mar 26 2015, 12:11 PM

(I'm an outsider and not involved so I'm just adding some naive comments here.)

@lucky: Have you set up a MediaWiki instance and played a bit with it, plus with wikitext markup (and potentially parsing it)? I highly recommend that to gain a deeper understanding. :)

whenever the filter finds a large collection of non wiki text or if it finds a large number of mispelled words,or possibly a lot of new images being put up by any unanimous Ip these all will be dealt with.

I'm afraid these criteria need some more investigation, e.g. how you want to identify "non wiki text" (and why you think that is a criterion to identify spam), why you think that misspelling is a criterion for spam, and why uploading lots of images is (in many wikis it might not be a criterion at all).

Even i will make the extension to keep a track of the edit history to check if a lot of changes have been done in the code.

In which "code"? Did you mean wiki content instead?

Even if Unrelated Commands are if being used then they would also be checked.

What are examples of "unrelated commands" in the context of wiki page content?

presence of lot of images, links, hyperlinks would also be considered under spam.

Again there might be wikis where a lot of such elements are present and totally valid content.

Whenever a user requests for the web page and he is redirected to another then he is said to be facing spam issue.

How much is this actually in the scope of MediaWiki itself? Wiki pages include links to external URLs. Any external URL can redirect to another external URL and that's nothing that you could ever check in MediaWiki itself because you've already left the MediaWiki instance in that browser window I think. But maybe I don't get what you refer to?

And generally speaking I guess you are already aware of existing functionality such as
https://www.mediawiki.org/wiki/Help:Sysop_deleting_and_undeleting
https://www.mediawiki.org/wiki/Manual:RevisionDelete

lucky added a comment.Mar 26 2015, 3:30 PM

First of all regarding the links every website has a particular url and if that url will not match with the url to which the user is being directed then that will obviously be considered as spam.

lucky added a comment.Mar 26 2015, 3:33 PM

lot of images means if images are compressed and stored as a zip file so that whenever user opens it a number of images are opened , i wanted to avoid this.same with the links if there is a chain of links one redirecting the user to another that needs to be avoided.

lucky added a comment.Mar 26 2015, 3:34 PM

i mean the page source code ,if there will be any major changes made in it.

lucky added a comment.EditedMar 26 2015, 3:35 PM

regarding uploading a lot of images ,i am not deleting them.they will first be checked then action would be decided.
Regarding the non wiki text , these things are already mentioned in the problem statement

lucky added a comment.Mar 26 2015, 3:38 PM

Thank you so much for your comments.

Qgil added a comment.Mar 26 2015, 4:00 PM

@lucky, every time you create a comment, a notification is sent to a bunch of people watching this task. Instead of five comments in eight minutes, you can use the eight minutes to write one comment. I what you want is to reply to specific parts of other comments, you can click the dropdown menu in the comment you wan to reply to, and then "Quote".

https://www.mediawiki.org/wiki/Phabricator/Help#Writing_comments_and_descriptions

lucky added a comment.Mar 26 2015, 6:13 PM

I will keep that in mind nextime sir..
Sorry for the inconvinience caused

@lucky, what microtask(s) have you completed or any you are currently working on?
I'd like to see a Gerrit link, please. Thank you!

lucky added a comment.Mar 27 2015, 6:24 PM

I was working on microtask T91092 but it was closed after a bug was reported.
At present i have solved no microtask

It's unfortunately unclear to me which specific aspects the above comments refer to as no quoting was used, so it's hard to follow. :(

i mean the page source code ,if there will be any major changes made in it.

Editing page content cannot alter the source code of the MediaWiki application. Maybe your understanding of "source code" is different, somehow?

I was working on microtask T91092 but it was closed after a bug was reported.

That task is still open but as it turned out to be more complicated the patch in Gerrit was abandoned. I'm not sure what was "closed" exactly. :)

At present i have solved no microtask

https://www.mediawiki.org/wiki/Annoying_little_bugs might also be helpful to find one.

lucky added a comment.Mar 28 2015, 7:00 PM

@jan
Sir any comments on my project

Note that some comments were given already above, plus so far a microtask still seems to be missing?

lucky added a comment.Mar 29 2015, 7:28 AM

How much is this actually in the scope of MediaWiki itself? Wiki pages include links to external URLs. Any external URL can redirect to another external URL and that's nothing that you could ever check in MediaWiki itself because you've already left the MediaWiki instance in that browser window I think. But maybe I don't get what you refer to?

And generally speaking I guess you are already aware of existing functionality such as

https://www.mediawiki.org/wiki/Help:Sysop_deleting_and_undeleting
https://www.mediawiki.org/wiki/Manual:RevisionDelete

First of all regarding the links every website has a particular url and if that url will not match with the url to which the user is being directed then that will obviously be considered as spam.

I'm afraid these criteria need some more investigation, e.g. how you want to identify "non wiki text" (and why you think that is a criterion to identify spam), why you think that misspelling is a criterion for spam, and why uploading lots of images is (in many wikis it might not be a criterion at all).

Regarding the non wiki text , these things are already mentioned in the problem statement

presence of lot of images, links, hyperlinks would also be considered under spam.

lot of images means if images are compressed and stored as a zip file so that whenever user opens it a number of images are opened , i wanted to avoid this.same with the links if there is a chain of links one redirecting the user to another that needs to be avoided.

whenever the filter finds a large collection of non wiki text or if it finds a large number of mispelled words,or possibly a lot of new images being put up by any unanimous Ip these all will be dealt

regarding uploading a lot of images ,i am not deleting them.they will first be checked then action would be decided.

Editing page content cannot alter the source code of the MediaWiki application. Maybe your understanding of "source code" is different, somehow?

Sir it will check the key files and directories in the mediawiki source code.

AYUSH removed a subscriber: AYUSH.Mar 29 2015, 10:03 AM

First of all regarding the links every website has a particular url and if that url will not match with the url to which the user is being directed then that will obviously be considered as spam.

Could you create a specific, non-abstract wikitext example of such a situation that you are describing and want to "consider as spam", on a subpage of your (empty) userpage?

lucky added a comment.Mar 29 2015, 9:16 PM

@Aklapper
what i am actually trying to say is that whenever the user clicks on a link it is redirected to a page.If by chance the page gets redirected to some other page rather than being getting redirected to that original page.
Sippose the user wantas to go to phabricator site so he clicks on the link on the mediawiki page but he is redirected to some other site which is not phabricator site then he is said to be facing a spam.

Let me repeat my Yes/No question: Could you create a specific, non-abstract wikitext example of such a situation that you are describing and want to "consider as spam", on a subpage of your (empty) userpage?

We'd like to see it then, please. Thank you.

lucky added a comment.Mar 31 2015, 6:36 AM

I cannot create a spam but I can design the way to deal with it

@lucky, how do you plan to *design* the way to deal with it? What we are really looking for is some indication that you have the programming capabilities this project needs. Since you haven't completed any microtask, we'd like to see an example of a spam page (as you understand it) under your user page (https://www.mediawiki.org/wiki/User:Dementer_lucky).

lucky added a comment.Mar 31 2015, 6:46 AM

It would be difficult for me to create a spam page because I haven't tried
my hand before in page designing

lucky added a comment.Mar 31 2015, 7:16 AM

But still I will try to make such a spam page.
And I will be soon finishing one of the bugs which I am trying to solve

I cannot create a spam but I can design the way to deal with it

If you cannot create a testcase I wonder how you want to test your code. ;)
You are not supposed to "design" a page - as this task is about the MediaWiki software, you could create an example "spam" wiki page yourself on https://test2.wikipedia.org

lucky added a comment.Apr 1 2015, 7:12 PM

@Aklapper
What do you want?
should I show you the spam condition or anything else?

@lucky, you're talking of pages with spam content. We want you to show us a page that has some spam content. Can you do that?
It seems you are unfamiliar with the process of creating wiki pages and populating them.
https://www.mediawiki.org/wiki/Help:Starting_a_new_page should help. Don't forget to create it under your user page. If unsure, ask on IRC.

lucky added a comment.Apr 1 2015, 7:27 PM

@Aklapper

https://www.mediawiki.org/wiki/User:Dementer_lucky

This is just a demo of what I am considering as spam.
Because whenever the user clicks on a particular page's link he desires that the browser opens that page but when it will not happen then it is considered spam. In this I have done it with the help of mediawiki feature but the person who will actually try to spam us might use something else.
Thank you

That "example" is just text which looks like a URL and being a link pointing to a URL that's not the text, like <a href="http://foo">http://bar</a>. I think that is not a common pattern in wiki content and rather unrelated/unimportant to "identifying and deleting spam pages" (my personal opinion only).

(Plus I asked you above to create your example on test2.wikipedia.org or on a subpage of your userpage, but instead you overwrote your userpage. I have no idea why that <code> markup is in your example page, please explain why you put it there.)

lucky added a comment.Apr 2 2015, 9:12 AM

Actually I am not getting what do you want

lucky added a comment.Apr 2 2015, 9:23 AM

@Aklapper
I am not getting an option to edit in the https://test2.wikipedia.org/wiki/Main_Page

@lucky, you don't have to add the text in the main page of the wiki. Create a sub page. I gave you a link above. Read it please.

And in general, try to spend some time on Wikipedia and understand the content writing style better. You'll be able to approach this project better then.

@lucky, the page was supposed to be a subpage of your user page, something like https://www.mediawiki.org/wiki/User:Dementer_lucky/test

Anyways, I added another example to your page which is what Andre was pointing out. How would you test if something like that is spam?

lucky added a comment.Apr 2 2015, 6:46 PM

@niharikaKohli
Will you explain me in which context do you consider a page as spam page so that it would be better for me to give an exact solution.

Josve05a removed a subscriber: Josve05a.Apr 3 2015, 12:20 AM

T90238 has some information in its description. So far my impression is unfortunately that you don't have a good idea what this task is about in the specific context of MediaWiki.

lucky added a comment.Apr 3 2015, 12:43 PM

@Aklapper
Sir I did as per you asked and regarding the project idea,I have already
gone through it and according to it only I have designed my project.

lucky added a comment.Apr 3 2015, 12:53 PM

@Aklapper
sir if you could find out some faults ,then I would be able to explain it
to you properly or make changes if required

Feedback has been given in this task so there is nothing to add from me, but I have not seen any updates to the proposal in this description of this task since feedback was given. In general, I cannot perform the further investigation (e.g. how spam on wiki pages is different from spam in emails) for you. :)

lucky added a comment.Apr 3 2015, 4:21 PM

Sir regarding that previous query you made
I had answered it in my subpage
Was it not sufficient?

It feels like we run a bit in circles. I already answered that in T93480#1173193. I referred to feedback given in this task, not one single bit of feedback given in this task. Anyway, I'm out of here I think. :)

Aklapper removed a subscriber: Aklapper.Apr 3 2015, 4:47 PM
lucky added a subscriber: Aklapper.Apr 3 2015, 4:53 PM

@Aklapper
Sir I know u had mentioned that before but still I think this condition is
necessary for the project.

Ricordisamoa added a comment.EditedApr 3 2015, 4:57 PM

There seems to be a communication problem here.
From what I can see, the candidate shows dedication but some misunderstandings appear to have arisen about MediaWiki and wiki systems in general.
I'd suggest that the candidate gain some more knowledge about the platform and tweak the proposal accordingly.

Aklapper removed a subscriber: Aklapper.Apr 4 2015, 9:17 PM