Page MenuHomePhabricator

Automatically detect spambot registration using machine learning like invisible reCAPTCHA (Vinitha V S)
Closed, ResolvedPublic

Description

Profile Information

Name Vinitha V S
IRC nickname on Freenode Groovier
Web Profile Github
Blog Vinitha
Location (country or state) Hyderabad, India
Typical working hours (include your timezone)
10am to 7pm (Indian Standard Time) (UTC+5:30)
(Timing is flexible and can be adjusted according to the convenience of mentor)

Synopsis

The problem statement is to create a captcha which is friendlier to humans and harder for bots to crack, in contrast to many of the existing captcha systems. Captcha systems are essential for wikimedia in order to prevent registration of spambots, which can cause unnecessary load on the system or even launch attacks on the system. It is interesting to note that the current captcha systems can be easily cracked by using simple bots and often cause undue inconvenience to humans.

The latest captcha systems which are able to meet the above requirements usually follow the strategy of invisible captchas, which distinguish humans from spambots by tracking user actions instead of issuing an explicit challenge. An explicit challenge is only issued in the case where the user behavior is suspicious. Many of the off-the-shelf solutions rely on uniquely identifiable user data which can cause privacy concerns. Hence the challenge is to build an invisible captcha system which uses only anonymous data. The main objective is to identify user behavior traits which can distinguish humans and create a Machine Learning model which uses these traits as features. The major tasks here include identifying the optimal set of features and identifying the best Machine Learning algorithm to use, generating relevant statistics to analyse the effectiveness of captchas, keeping in mind the resource budget as well.

The existing Wikimedia captcha has to be changed as statistics show that bots can easily crack the captcha while recognizing captchas correctly is hard for humans. The discussion regarding an updated version of captcha has been going on for a while and implementing the new captcha version is of utmost importance. Implementing the revised version will benefit the community as it can keep the bots away, make it less stressful for humans to use captcha and release OTRS volunteers of the large number of account creation requests.

Tasks

  • Design and implement the captcha system.

The captcha might require only a simple click like the Google invisible captcha. In the background we would use javascript to record user anonymous behavior.

  • Decide the features to be used for the Machine Learning Model.

Some relevant features include mouse speed, mouse clicks and drags, mouse entry point, average typing speed, number of uses of backspace and delete button etc.

  • Train, evaluate and compare the Machine Learning algorithms to be employed.

There are multiple approaches, like clustering (eg: Gaussian Mixture Model) the data to see which cluster the particular user behavior belongs to, applying classification methods like SVM, Neural Networks etc.

  • Design and implement fallback captcha and integrate with first level captcha.

In case of a suspicious behaviour detected by the invisible captcha, we need a fallback captcha which is used as a harder test to keep away bots. The usual fallback captcha deployed includes distorted images of words, question captcha, MAPTCHA (Math Captcha), selecting one or more relevant images from a set of images etc.

We can explore and experiment with other captchas if time permits. A good candidate for this task would be a captcha which uses Neural Network to generate cursive handwritten text. This also facilitates using different handwriting styles. This would be harder for current OCR’s to crack and hence harder for spambots to crack, but easier for humans.

  • Possible Mentor(s)

Gergő Tisza, Adam Roses Wight

  • Have you contacted your mentors already?

Yes

Timeline

Ramp-up Period (23rd October - 9th November)

  • Completion of micro-tasks and fixing bugs
  • Do extensive literature review regarding different existing Captcha systems and their weaknesses

Community bonding Period (10th November - 4th December)

  • Get to know the community and the codebase
  • Explore possibilities for different invisible captcha or other hard captcha algorithms which overcome current vulnerabilities
  • Get in touch with mentor, discuss overall plan for the project and revise plan if necessary
  • Explore datasets required for the learning algorithm

Epic 1 - Implement invisible captcha data capture (5th December - 31st December)

Week 1 and 2 (5th December to 15th December)

  • Experimenting on various user behaviours to choose good features for the Machine Learning model
  • Create/get/transform dataset for the Machine Learning model (Available dataset might not be directly usable since it might contain absolute mouse positions)
  • If there is no good data available for user behaviour of bots, we may have to generate this data
  • Implement simple invisible captcha and data capture

Week 3 (16th December to 23th December)

  • Testing and bug fixes
  • Documentation
  • Code review

Epic 2 - Choosing algorithm and training Machine Learning model (1st January - 31st January)

Week 4 (24th December to 31th December)

Week 5 (1st January to 8th January)

  • Classification:
    • SVM
    • BLSTM (Bidirectional LSTM- This is likely to be the best choice since the data is sequential)
    • GRU RNN

Week 6 and 7 (9th January to 20th January)

  • Evaluate and compare the performance of the above Machine Learning models
  • Documentation and review

Epic 3 - Implement captcha validation using Machine Learning model (1st February - 11th February)

Week 8 and 9 (21st January to 31st January)

  • Statistics generation for validating captcha performance
  • Integrate Machine Learning model with the system so that data captured using EventLogging/mw.track is tested against the model and the result is returned back to the frontend to determine captcha success
  • Testing and bug fixes
  • Documentation
  • Code Review

Epic 4 - Integrate fallback plugin (12th February - 28th February)

Week 9 and 10 (1st February to 11th February)

  • Integrate fallback captcha
  • Validate user entered captcha value
  • Bring up this captcha as a fallback if invisible captcha fails

Week 11, 12 and 13 (12th February to 28th February)

  • Testing / bug fixes
  • Documentation
  • Code review

Final Week 13 (1st March - 5th March)

  • Fix open bugs
  • Documentation for changes

Deliverables:

Documentation

  • Literature review on existing Captcha systems
  • Overall captcha system architecture - Invisible captcha
  • Overall captcha system architecture - Fallback captcha
  • Dataset review
  • Machine learning model performance comparison results
  • Component integration documentation
  • Basic code documentation

Code

  • Deciding on features (iPython)
  • First level invisible captcha (Javascript and PHP)
  • Machine Learning model creation for invisible captcha (iPython)
  • Machine Learning captcha validation code (Python)
  • Fallback captcha frontend integration (Javascript and PHP)
  • Statistics generation code (python)

Participation

Communicating the issues and ideas is crucial to the success of the project and hence I will ensure that I clearly post the issues I face in relevant forums and IRC channel for speedy resolution of issues. Also I will update the goal for current week, issues faced and resolved, and progress made in each task on my personal blog. I shall update the source code in git after task completion.

About Me

  • Education

I have completed course and thesis requirements for MS by Research in Computer Science and Engineering, at IIIT Hyderabad. My thesis is currently under review. I will graduate in 2017. My research lab is CVIT where my work is related to OCR post-processing.

  • How did you hear about this program?

A friend at college informed me about this program.

  • Will you have any other time commitments, such as school work, another job, planned vacation, etc, during the duration of the program?

No, I am completely available during the whole period of internship and have no other commitments.

  • We advise all candidates eligible for Google Summer of Code and Outreachy to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?

No, I plan to apply for only Outreachy program.

  • What does making this project happen mean to you?

I always wanted to apply what I have learnt in creating something which can benefit many people. When I saw this project mentioned in Wikimedia Outreachy page, I knew this is what I am looking for. This work also has a lot of scope for experimenting novel ideas which can solve the problem without complicating it too much. This level of experimentation might take time beyond the internship period, but I am on it even after the internship period. I have seen many people discussing many different ideas in forums/threads. The feedback they received and the eagerness they show to make captchas better have given me a strong belief that together we can create an effective captcha which can keep bots away. I am not being unrealistic and I know that hackers can eventually create better bots to break any existing captcha system. I feel this is a continuous improvement journey (for me personally and for captchas) , and the idea of not giving up and using better method each time will yield rewarding results.

  • Past Experience

I have worked on academic projects with emphasis on Machine Learning. My course works include Natural Language Processing, Computer Vision, Machine Learning (Pattern Recognition), Text-to-speech systems, Information Retrieval and Extraction, all of which required application of Machine learning for problem solving.

My thesis work was done in conjunction with the Indic OCR team in my Research Lab . The objective of my work was to improve the overall accuracy of state-of-the-art Indic OCR system by detecting OCR errors and correcting them. I used Deep Learning, Discriminative (Support Vector Machine) and Generative (GMM based clustering) machine learning methods to detect errors. I have a peer reviewed oral paper describing my work, which uses among other methods, a Bidirectional LSTM to detect errors and has reported the highest average error detection accuracy till date in Indic OCRs.

I also worked as a MachineLearning/Computer Vision intern at Fabulyst, a startup where I used Convolutional Neural Networks to predict different attributes of dresses like dress type, print, patterns etc. from the images of models wearing these dresses. This also helped me widen my knowledge in Computer Vision and Machine Learning.
With my experience in applying machine learning to different problem statements, I am confident I will be able to find a working solution to the spambot detection problem.

I have not worked much with other open source projects and have just begun to contribute a few days back. I know I am a little late to join the open source community, but I am sure this experience can help me fearlessly contribute more in the future.

Tasks Completed

Bugs Fixed

  • Resolved T178099 and merged the code into master- movedelete message is now correctly displayed in all cases where a deleted page is requested .

HomeWork:

This section details the reading and research done to understand about the existing captchas and about bots cracking the captchas. The following posts give much insight into what approach should be taken while designing an effective captcha system.

Other Research Papers:

Event Timeline

Hi, I have filled the application to confirm my eligibility. I have not described about project in detail.
Please revert if I have to complete the applicaiton before submission,

Tgr removed Tgr as the assignee of this task.Oct 22 2017, 3:34 AM
Tgr subscribed.

Hi @Tgr Please find the updated proposal. Kindly let me know if there is anything I should change.

Groovier updated the task description. (Show Details)

Your eligibility details are still missing from the outreachy.gnome.org form, please add that ASAP. See their recent email on what details are required.

Groovier updated the task description. (Show Details)

@Tgr Thanks for the information. I have updated the form.

Hi @Groovier,

might require a more sophisticated interaction like tracking an object which is flashed on screen

This might be unfair to most humans, e.g. old people, since it would be asking for a type of vision and cursor proficiency that isn't required for the rest of Wikipedia. Likewise, the “cursive handwritten text” suggestion is extremely education-specific, since cursive taught in schools varies widely e.g. cursive handwriting has been dropped from most U.S. schools as far as I’m aware. These are very interesting suggestions, but I would suggest dropping them from the initial phase of your proposal. We can always leave hooks for additional fallback captchas, and expand the scope of the project if you end up having lots of unused time at the end.

Weeks 12 and 13 are quite compressed, in my experience testing and fixing bugs can easily take up as much as 1/3rd of the overall project time.

Wikimedia's unique privacy restrictions are probably worth adding to your "homework" section, or otherwise incorporating into the planning process.

Impressive proposal!

Thank you @awight. I shall update the proposal with necessary changes.

Tgr renamed this task from Automatically detect spambot registration using machine learning like invisible reCAPTCHA[edit] to Automatically detect spambot registration using machine learning like invisible reCAPTCHA (Vinitha V S).Dec 3 2017, 6:29 AM
Tgr removed a subscriber: MediaWiki-extensions-reCaptcha.