**Name** Vinitha V S
**IRC nickname on Freenode** grooves
**Web Profile** [[ https://github.com/vinithegit | Github ]]
**Blog** [[ http://vinithavs.wordpress.com | Vinitha ]]
**Location (country or state)** Hyderabad, India
**Typical working hours **(include your timezone)
10am to 7pm (Indian Standard Time) (UTC+5:30)
(Timing is flexible and can be adjusted according to the convenience of mentor)
The problem statement is to create a captcha which is friendlier to humans and harder for bots to crack, in contrast to many of the existing captcha systems. Captcha systems are essential for wikimedia in order to prevent registration of spambots, which can cause unnecessary load on the system or even launch attacks on the system. It is interesting to note that the current captcha systems can be easily cracked by using simple bots and often cause undue inconvenience to humans.
The latest captcha systems which are able to meet the above requirements usually follow the strategy of [[ https://developers.google.com/recaptcha/docs/invisible | invisible captchas ]], which distinguish humans from spambots by tracking user actions instead of issuing an explicit challenge. An explicit challenge is only issued in the case where the user behavior is suspicious. Many of the off-the-shelf solutions rely on uniquely identifiable user data which can cause privacy concerns. Hence the challenge is to build an invisible captcha system which uses only anonymous data. The main objective is to identify user behavior traits which can distinguish humans and create a Machine Learning model which uses these traits as features. The major tasks here include identifying the optimal set of features and identifying the best Machine Learning algorithm to use, generating relevant statistics to analyse the effectiveness of captchas, keeping in mind the resource budget as well.
The existing Wikimedia captcha has to be changed as statistics show that bots can easily crack the captcha while recognizing captchas correctly is hard for humans. The [[ https://phabricator.wikimedia.org/T141490 | discussion ]] regarding an updated version of captcha has been going on for a while and implementing the new captcha version is of utmost importance. Implementing the revised version will benefit the community as it can keep the bots away, make it less stressful for humans to use captcha and release OTRS volunteers of the large number of account creation requests.
- Design and implement the captcha system.
The captcha might require only a simple click like the Google invisible captcha or might require a more sophisticated interaction like tracking an object which is flashed on screen. Using the latter approach will make it harder for bots to learn the behavior since the position of the object can be made random.
- Decide the features to be used for the Machine Learning Model.
Some relevant features include mouse speed, mouse clicks and drags, mouse entry point, average typing speed, number of uses of backspace and delete button etc.
- Train, evaluate and compare the Machine Learning algorithms to be employed.
There are multiple approaches, like clustering (eg: Gaussian Mixture Model) the data to see which cluster the particular user behavior belongs to, applying classification methods like SVM, Neural Networks etc.
- Design and implement fallback captcha and integrate with first level captcha.
In case of a suspicious behaviour detected by the invisible captcha, we need a fallback captcha which is used as a harder test to keep away bots. The usual fallback captcha deployed includes distorted images of words, question captcha, MAPTCHA (Math Captcha), selecting one or more relevant images from a set of images etc.
Another good candidate for this task would be a captcha which uses [[ https://www.cs.toronto.edu/~graves/handwriting.html | Neural Network to generate cursive handwritten text ]]. This also facilitates using different handwriting styles. This would be harder for current OCR’s to crack and hence harder for spambots to crack, but easier for humans.
- **Possible Mentor(s)**
Gergő Tisza, Adam Roses Wight
- **Have you contacted your mentors already?**
**Ramp-up Period (23rd October - 9th November)**
- Completion of micro-tasks and fixing bugs
- Do extensive literature review regarding different existing Captcha systems and their weaknesses
**Community bonding Period (10th November - 4th December)**
- Get to know the community and the codebase
- Explore possibilities for different invisible captcha or other hard captcha algorithms which overcome current vulnerabilities
- Get in touch with mentor, discuss overall plan for the project and revise plan if necessary
- Explore datasets required for the learning algorithm
**Epic 1 - Implement invisible captcha data capture (5th December - 31st December)**
Week 1 and 2 (5th December to 15th December)
- Experimenting on various user behaviours to choose good features for the Machine Learning model
- Create/get/transform dataset for the Machine Learning model (Available dataset might not be directly usable since it might contain absolute mouse positions)
- If there is no good data available for user behaviour of bots, we may have to generate this data
Week 3 (16th December to 23th December)
- Implement simple invisible captcha and data capture
- Implement novel invisible captcha (tracking an object flashed on the screen) to get mouse movement data which is hard to fake for a bot
Week 4 (24th December to 31th December)
- Testing and bug fixes
- Code review
**Epic 2 - Choosing algorithm and training Machine Learning model (1st January - 31st January)**
Week 5 (1st January to 8th January)
- [[ https://ocs.aaai.org/Papers/KDD/1996/KDD96-037.pdf | DBSCAN ]]
Week 6 and 7 (9th January to 20th January)
- BLSTM (Bidirectional [[ https://en.wikipedia.org/wiki/Long_short-term_memory | LSTM ]]- This is likely to be the best choice since the data is sequential)
- [[ https://en.wikipedia.org/wiki/Gated_recurrent_unit | GRU RNN ]]
Week 8 and 9 (21st January to 31st January)
- Evaluate and compare the performance of the above Machine Learning models
- Documentation and review
**Epic 3 - Implement captcha validation using Machine Learning model (1st February - 11th February)**
Week 9 and 10 (1st February to 11th February)
- Integrate Machine Learning model with the system so that data captured using EventLogging/mw.track is tested against the model and the result is returned back to the frontend to determine captcha success
- Testing and bug fixes
- Code Review
**Epic 4 - Implement fallback plugin (12th February - 28th February)**
Week 11 (12th February to 19th February)
- Integrate handwriting generation code with backend
- Send generated image to frontend
- Validate user entered captcha value
- Bring up this captcha as a fallback if invisible captcha fails
Week 12 and 13 (20th February to 28th February)
- Statistics generation
- Testing / bug fixes
- Code review
Final Week 13 (1st March - 5th March)
- Fix open bugs
- Documentation for changes
- Literature review on existing Captcha systems
- Overall captcha system architecture - Invisible captcha
- Overall captcha system architecture - Fallback captcha
- Dataset review
- Machine learning model performance comparison results
- Component integration documentation
- Basic code documentation
- Deciding on features (iPython)
- Machine Learning model creation for basic captcha (iPython)
- Machine Learning captcha validation code (Python)
- Machine Learning model creation for object tracking captcha (iPython)
- Machine Learning captcha validation code for object tracking captcha (Python)
- Fallback captcha handwriting generation code (Python)
Communicating the issues and ideas is crucial to the success of the project and hence I will ensure that I clearly post the issues I face in relevant forums and IRC channel for speedy resolution of issues. Also I will update the goal for current week, issues faced and resolved, and progress made in each task on my personal blog. I shall update the source code in git after task completion.
# **About Me**
I have completed requirements for MS by Research student in Computer Science and Engineering, at [[ https://www.iiit.ac.in/ | IIIT Hyderabad ]]. I will graduate in 2017. My research lab is [[ http://cvit.iiit.ac.in | CVIT ]] where my work is related to [[ http://ocr.iiit.ac.in | OCR post-processing ]]. I have completed the course and thesis requirements, and have submitted my thesis which is currently under review.
- **How did you hear about this program?**
A friend at college informed me about this program.
- **Will you have any other time commitments, such as school work, another job, planned vacation, etc, during the duration of the program?**
No, I am completely available during the whole period of internship and have no other commitments.
- **We advise all candidates eligible for Google Summer of Code and Outreachy to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?**
No, I plan to apply for only Outreachy program.
- **What does making this project happen mean to you?**
I always wanted to apply what I have learnt in creating something which can benefit many people. When I saw this project mentioned in Wikimedia Outreachy page, I knew this is what I am looking for. This work also has a lot of scope for experimenting novel ideas which can solve the problem without complicating it too much. This level of experimentation might take time beyond the internship period, but I am on it even after the internship period. I have seen many people discussing many different ideas in [[ https://phabricator.wikimedia.org/T141490 | forums/threads ]]. The feedback they received and the eagerness they show to make captchas better have given me a strong belief that together we can create an effective captcha which can keep bots away. I am not being unrealistic and I know that hackers can eventually create better bots to break any existing captcha system. I feel this is a continuous improvement journey (for me personally and for captchas) , and the idea of not giving up and using better method each time will yield rewarding results.
- **Past Experience**
I have worked on academic projects with emphasis on Machine Learning. My course works include Natural Language Processing, Computer Vision, Machine Learning (Pattern Recognition), Text-to-speech systems, Information Retrieval and Extraction, all of which required application of Machine learning for problem solving.
My thesis work was done in conjunction with the Indic OCR team in my Research Lab . The objective of my work was to improve the overall accuracy of state-of-the-art Indic OCR system by detecting OCR errors and correcting them. I used Deep Learning, Discriminative (Support Vector Machine) and Generative (GMM based clustering) machine learning methods to detect errors. I have a peer reviewed oral [[ http://ocr.iiit.ac.in/data/publications/Error_Detection_in_Indic_OCRs.pdf | paper ]] describing my work, which uses among other methods, a Bidirectional LSTM to detect errors and has reported the highest average error detection accuracy till date in Indic OCRs.
I also worked as a MachineLearning/Computer Vision intern at [[ http://fabulyst.com/ | Fabulyst ]], a startup where I used Convolutional Neural Networks to predict different attributes of dresses like dress type, print, patterns etc. from the images of models wearing these dresses. This also helped me widen my knowledge in Computer Vision and Machine Learning.
With my experience in applying machine learning to different problem statements, I am confident I will be able to find a working solution to the spambot detection problem.
I have not worked much with other open source projects and have just begun to contribute a few days back. I know I am a little late to join the open source community, but I am sure this experience can help me fearlessly contribute more in the future.
My tasks are:
- Completed [[ https://phabricator.wikimedia.org/T175330 | Task 1 ]]
and my changes to code can be viewed [[ https://gerrit.wikimedia.org/r/#/c/384753/ | here ]]:
- [[ https://gerrit.wikimedia.org/r/#/c/385845/ | Task 2 ]] submitted for review:
I am currently working on the other tasks which will be completed soon. I will also be working on fixing bugs in mediawiki now.
This section details the reading and research done to understand about the existing captchas and about bots cracking the captchas. The following posts give much insight into what approach should be taken while designing an effective captcha system.
- Wikipedia sources about [[ https://en.wikipedia.org/wiki/CAPTCHA | captcha ]] and [[ https://en.wikipedia.org/wiki/ReCAPTCHA | recaptcha ]]
- Interesting details about defeated captchas: [[ http://caca.zoy.org/wiki/PWNtcha | PWNtcha ]]
- Captchas using flash/ animated captchas : [[ http://www.h-online.com/security/news/item/NuCaptcha-Flash-CAPTCHAs-to-combat-spambots-1032147.html | NuCaptcha ]]
- Detailed discussion about [[ https://stackoverflow.com/questions/25545514/how-does-this-checkbox-recaptcha-work-and-how-can-i-use-it/25626267#25626267 | Google’s invisible captcha ]]
- More about [[ https://tehnoblog.org/google-no-captcha-recaptcha-first-experience-results-review/ | no captcha ]]
- A step-wise detailed [[ http://www.brains-n-brawn.com/default.aspx?vDir=aicaptcha | blog ]] illustrating how captcha was cracked employing computer vision and logic.
**Other Research Papers:**
- [[ https://dl.acm.org/citation.cfm?id=2046724 | Text-based Captcha Strengths and weaknesses ]]
- [[ http://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf | I’m not a human: Breaking the Google Recaptcha ]]
- [[ https://www.usenix.org/system/files/conference/woot17/woot17-paper-bock.pdf | unCaptcha: A Low-Resource Defeat of reCaptcha’s Audio Challenge ]]
- [[ https://link.springer.com/chapter/10.1007%2F978-3-540-30144-8_23 | Image Recognition CAPTCHAs ]]
- [[ http://ieeexplore.ieee.org/abstract/document/5958019/ | The Failure of Noise-Based Non-Continuous Audio Captchas ]]