Page MenuHomePhabricator

Proposal: Automatically detect spambot registration using machine learning (like invisible reCAPTCHA)
Closed, DeclinedPublic


Profile Information

Kamakshi Suri
IRC nickname on Freenode
Blog URL
Github URL
Linkedin URL
Resume (optional)
Location (country or state)
City: New Delhi
State: Delhi
Country: India
Typical working hours (include your timezone)
Working Hours: 11:00 AM-8:00 PM and 11:00 PM to 4:00 AM
Timezone: UTC +5:30


Short summary describing your project and how it will benefit Wikimedia projects
Existing Captcha mechanism used by wikimedia causes a lot of hindrance for human users but easlily permits spambots which leads to a bad user experience. Main objective of this project is to prevent bots from entering into Wikimedia and provide a hassle free experience to human users when they signup. Machine learning will be used to achieve this objective, by building a model which classifies a user based on cursor movements and user's actions on signup page. If system classifies a user as a bot, it will have to pass through a captcha test to confirm whether it is a human or bot. This will act a double layer protection for Wikimedia.
Once this task is accomplished, number of fake registrations will reduce to a great extent and Wikimedia's data will not be in continuous threat of getting leaked. It will be a game changer for wikimedia and an accomplishment over bots. It will also allow wikimedia to securely work on more new innovations without worrying for data loss.
Possible Mentor(s)
@Tgr , @awight
Have you contacted your mentors already?
Yes, i have been in touch with them since the application period started.


Describe the timeline of your work with deadlines and milestones, broken down week by week. Make sure to include time you are planning to allocate for investigation, coding, deploying, testing and documentation

1) Gathering data (Nov 9, 2017- Dec 5, 2017)

  • Existing Wikimedia logs will be used to collect data according to our requirements.
  • Following are some metrics which will be recorded simultaneously and checked for each user:-
    1. Action creation click through rates: number of users clicking on submit button per landing page impression.
    2. Account creation conversions: Number of accounts successfully created for a given treatment per landing page impression.
    3. User block: proportion of new accounts that go on to be blocked.
  • The better the variety, density and volume of relevant data, better the learning prospects for the machine becomes.

2) Preparing the data (Dec 5, 2017- Dec 11, 2017)

  • Quality of data will be determined and steps will be taken for fixing issues such as missing data and treatment of outliers.
  • Exploratory analysis will be carried out which will study the nuances of the data in details thereby burgeoning the nutritional content of the data.

3) Identify set of input features and target attributes (Dec 12, 2017- Dec 18, 2017)

  • Suggested Features
    1. Click Frequency: calculates number of clicks per second.
    2. Click Positions: tells about the position where user has clicked on Signup page.
    3. Scrolls: speed of scroll on Signup page.
    4. Keystroke frequency: no. of keyups per second.
    5. Cursor trajectory: the path which is followed by cursor on signup page.
    6. Elapsed time between various clicks: used to specify whether the time difference between consecutive clicks is in human range.
    7. Mouse speed: check whether the mouse speed is in human range
    8. Analyzing hover and focus events: analyze on which div element or on what area of page the cursor hovered or focused for what time.
    9. Action frequency: number of movements(clicks/keyups,scroll, etc) done per sec.
    10. Screen Percentage covered: percentage of screen on which cursor moved.
    11. Most traveled direction: direction in which maximum movements occurred.
  • Above features have been framed keeping in mind the following points:-
      1. Human mouse movements has non-uniform speeds, reaction times, impressions(clicks on different coordinates of buttons).
    1. An automated click might appear from nowhere and cause a flurry if unlikely hover, focus events, etc.
    2. Clicking some pattern over and over, simultaneous keyboard and mouse usage.

4) Training the model and testing various techniques and algorithms (Dec 19, 2017- Jan 1, 2018)

  • Choose appropriate algorithm and representation of data in the form of the model.
  • Data is split into two parts – train and test (proportion depending on the prerequisites); the first part (training data) is used for developing the model. The second part (test data), is used as a reference.
  • Various classifiers such as Neural network, SVM classifier and Decision trees will be tested to find the most suitable one. These are the ones which will be a good fit for this problem.
  • Other classifier algorithms such as Bayesian, Logistic Regression, etc can also be used to find the most effective one.
  • Best one should be selected by cross-validation.

5) Testing and Evaluation (Jan 14, 2018- Jan 25, 2018)

  • Second part of the data (holdout / test data) is used for testing. Testing determines the precision in the choice of the algorithm based on the outcome.
  • A better test to check accuracy of model is to see its performance on data which was not used at all during model build.
  • Python script will be coded to evaluate models on the basis of their graphical representation.
  • Evaluation metric should be decided to compare results. Some measures which we can use to compare are speed, accuracy, error rate, f1-score, precision, performance, etc.

6) Improving the performance (Jan 26, 2018- Feb 1, 2018)

  • Introduce more variables to augment the efficiency.
  • Existing features can also be improved to yield optimal results and reduce overfitting. -Code will be improved to make it more generalized according to project requirements.

7) Integration and deployment of code, bug fixes and cleanup (Feb 7, 2018- Feb 12, 2018)

  • Code integration on signup page and deployment of final to check it's performance in real-time scenario.
  • Fixing of bugs which occurred after deployment.
  • Cleaning of code and finalizing it according to wikimedia's standards.

8) Integrate the Captcha system which will be used for bots as additional security layer (Feb 13, 2018- Feb 25, 2018)

  • Deploy FancyCaptcha system as an extra protection layer.
  • Build an Audio Captcha mechanism for blind people.

9) Completing documentation and final code cleanup and bug fixes (Feb 26, 2018- March 5, 2018)

  • Complete documentation.
  • Fix all errors and freeze the code.

Future work:-

  1. Adding localization to FancyCaptcha and AudioCaptcha for users who are not well-versed with English.
  2. Improving the classifier by upgrading it to higher level classifier such as convolutional network.
Tasks to be completedTimeline
Community Bonding Period, get better understanding of the project, exploring existing tools, get familiarized with Wikimedia tools, finalizing the design document with all the specifications to ease out the workflow of project and gather data.Nov 9, 2017-Dec 4, 2017
Prepare gathered data according to project requirements.Dec 5, 2017-Dec 11, 2017
Identify various features and devise a set of metrics to prepare a feature setDec 12, 2017- Dec 18, 2017
Train the model and apply various classifiersDec 19, 2017-Jan 1, 2018
Frame the documentation for the features and components built so far and fix encountered bugs (First phase evaluation)Jan 2, 2018-Jan 6, 2018
Check various other classifiers to find a best fitJan 7, 2018- Jan 13, 2018
Comparative study of various classifiers and evaluationJan 14, 2018- Feb 25, 2018
Improving performance of the modelJan 26, 2018- Feb 1, 2018
Finalize the classifier based on evaluation results.Update Documentation with all the additions and updations.(Second phase evaluation)Feb 2, 2018- Feb 6, 2018
Integrate and deploy code on Signup page and bug fixesFeb 7, 2018- Feb 12, 2018
Integrate FancyCaptcha for bots and code an audio captcha support for blind usersFeb 13, 2018- Feb 25, 2018
Complete documentation, freeze code and final report submission (Final evaluation)Feb 26, 2018- March 5, 2018


Describe how you plan to communicate progress and ask for help, where you plan to publish your source code, etc
As MediaWiki uses Phabricator which is a set of powerful tools used for managing bugs and tasks, I also intend to use it for tracking bugs, features. It is also helpful for getting feedback from people who are a part of the organisation. All bugs and features will have tasks linking to the project, which will allow easy tracking and monitoring.
The code base will also use Git for reviewing and managing workflow.I'm comfortable with using Git for project development and management.
I can be contacted on email or IRC, also I intend on using my new blog for sharing my experience while working on the project.I will try to weekly update my blog by writing new posts.
I believe IRC and Mailing lists are great place to seek help.It would also be great to get in touch directly with my mentors via email if possible.
If i get selected, outreachy will be my main area of focus during my semester break.

About Me

Tell us about a few:

  • Your education (completed or in progress)

Pursuing B.E [2014-2018(expected)]
Major: Computer Engineering
University: Netaji Subhas Institute Of Technology

  • How did you hear about this program?

One of my friend was a part of Outreachy (Round 13), since then I am eager to be a part of this internship programme.

  • Will you have any other time commitments, such as school work, another job, planned vacation, etc, during the duration of the program?

I will be having semester break from December first week till 3rd week of January(as per previous year calendar), which sums up to consecutive 7 weeks in total. I don't have any exams during December 5, 2017 and March 5, 2018.
Previous year academic calender

  • What does making this project happen mean to you?

Wikimedia is a place where people collaborate to empower free educational content which fascinates me to contribute to my best ability. Contributing for such a mission motivated me to take this project. This project is related to machine learning which is a field with which i can relate alot as i have done a few projects in my past related to it. This project in itself is a challenge because it is about building a model which will conquer spambots, which are so smart that their accuracy is 99.80%. I am a person who loves taking up challenges and this project is the true definition of a challenge. This project has so far made me learn so many new things and being an avid learner I am cent percent sure that in future it will push my knowledge base to some other level. Completing this project will improve the Wikimedia's security system which will give me a satisfaction that i have paid back to the source from which i gain so much knowledge.

Past Experience

Describe any relevant projects that you've worked on previously and what knowledge you gained from working on them.

Online-Treasure-Hunt: I have developed a platform for online treasure hunt. This is a ready to use portal to host an online-Treasure-Hunt. I built it for a competition hosted by the technical society of my university. It is coded in native PHP and is safe from various security flaws. From this project i have learnt how to deal with the problems encountered in a live running code. From this project i have also gained knowledge of various aspects of security issues and improved my coding practice to make my code efficient and flexible.

Survive: Web portal for a Disaster Management Android Application android application. It is designed for volunteers where they can register and receive messages. I have also deigned API through which information can be exchanged between web portal and android application. It is build in PHP. It is a project of my first hackathon, through this project i learnt how to make a final product in a simulated time and realized the importance of modular and efficient coding.
Face-Recognition Tool: This is a simple python script which detects faces in an image. This project was a good learning experience for me. I majorly learnt about OpenCV and python through this project.

Hospital-Hacks: It is an interface made to reduce the gap between hospitals and the patients. It aims at automating all the thought process that goes into looking for a hospital around one’s house considering their needs. It gives access to live scenario of the hospital to the user. Doctor and reception interface is built in PHP and for user there is an android application.

Moksha-Website: This is a portal for Inter college festival of my university. I have coded a few APIs for event registration and contestant registration for this website. Through this project i learnt to code modularly and gained an experience to work with with a huge team.

Content-Management-System: I have also designed and coded a content-management system for computer society of my university. It is developed to smoothen the content control of website and social media platforms. This project helped to learn how to integrate various APIs into my code efficiently. Through this i learnt to deal with all the permissions issues required to access external sources.
Describe any open source projects you have contributed to as a user and contributor (include links). If you have already written a feature or bugfix for a Wikimedia technology such as MediaWiki, link to it here; we will give strong preference to candidates who have done so
I have contributed to Systers oragnization. Systers is a place where people collaborate to empower women in technology. It is one of the organizations which not only support women but motivate them to work hard.
Link to my contributions:-
Contribution 1
Contribution 2
Contribution 3

I have also contributed to Wikiemdia common:-
Link to my contributions:-
Contribution 1
Contribution 2

Any Other Info

Add any other relevant information such as UI mockups, references to related projects, a link to your proof of concept code, etc
Experience with Wikimedia:-
My experience so far with Wikimedia had been a great learning journey where I have been given all the support needed by a newbie. Wikimedia is a place where people collaborate to empower free educational content which fascinates me to contribute to my best ability. Coding practices at Wikimedia are perfectly designed and well-defined resulting in robust, clean & easily understandable code.

Experience with Project:-
This project taught me about reCaptcha, Invisible reCaptcha, WikimediaEvents, EventLogging, EventLogging Schema, ResourceLoader, Mouse Tracking, and Gerrit.
I have also read a lot about Bots, their functioning and behaviour. Moreover, I have studied various AI techniques to detect bots and to prevent them from entering into our system. I have keen interest in Artificial Intelligence and due to which this project attracted me.
I have worked on the following microtasks:-
Microtask 1(T177034): Collect captcha data from signup page

Microtask 3(T177033): Analyze sample mouse movement data and extract feature vectors

Workflow Deign:-

Event Timeline

Kamsuri5 created this task.Oct 23 2017, 4:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2017, 4:01 PM
Kamsuri5 updated the task description. (Show Details)Oct 23 2017, 4:15 PM
Kamsuri5 updated the task description. (Show Details)
Kamsuri5 updated the task description. (Show Details)Oct 23 2017, 4:17 PM
Kamsuri5 updated the task description. (Show Details)Oct 23 2017, 4:34 PM
Kamsuri5 updated the task description. (Show Details)Oct 23 2017, 4:40 PM

@Tgr can you please review this and give any suggestions?

srishakatux renamed this task from Automatically detect spambot registration using machine learning (like invisible reCAPTCHA) to Proposal: Automatically detect spambot registration using machine learning (like invisible reCAPTCHA).Oct 23 2017, 7:09 PM
Kamsuri5 updated the task description. (Show Details)Oct 23 2017, 7:54 PM
Tgr closed this task as Declined.Nov 11 2017, 8:10 PM

Thanks for participating in Outreachy! Unfortunately this proposal wasn't selected.