Name: Saurabh Batra
IRC nick: saurabhbatra
Web Page: http://saurabhbatra96.github.io/
Typical working hours: 12 PM - 10 PM UTC+5:30
The project aims to build a new open-source fraud detection system. The 2 major steps involved are:
- experimenting with various anomaly detection techniques (see the ML section at the end) to figure out which one provides a required balance of precision (% of detected frauds which are actually fraudulent) and recall (% of all frauds detected);
- providing the technique as an independent web service to WMF (like ORES) which can entertain requests to ascertain the authenticity of transactions.
- The web service uses the feedback from its decisions (new correct detection/wrong detection corrected by a human) to train the underlying model, improving its accuracy in the future.
- Use something like LIME to provide a justification as to why our classifier chose to mark a transaction as fraud.
- CiviCRM extension to interface directly with the web service.
Possible Mentor(s) @Eileenmcnaughton , @awight
Have you contacted your mentors already? I've already worked with Eileen for about an year back in 2016 which included a GSoC project for CiviCRM and have discussed the proposal with Adam.
I’m going to divide the work into 2 major phases:
Experimentation phase (May - mid June)
The experimentation phase will majorly consist of trying out the proposed techniques on the current dataset and comparing how they perform against each other and against the current fraud detection system. Tentative tasks include:
- (Pre-Week 1 - Week 1) Dataset procurement and cleaning
- (Week 1-2) Reading up and applying feature selection to the dataset
- (Week 2-5) Reading up and applying anomaly detection techniques; comparing precision and recall scores; deciding on the best technique for the web service
- Checkpoint 1 Experimentation phase finished; we should know which technique works the best. Deliverables Theoretical knowledge as to how the fraud detection system should be modeled along with a proof of concept.
Architectural phase (June - August)
The architectural phase involves integrating the best-performing technique with a web service. Tentative tasks include:
- (Week 6) API design for the web service
- (Week 6-7) Setting up the bare-bones architecture for the web service
- Checkpoint 2 Mid-way through the architectural phase; we have the model and we have the web service separately. Deliverables: Proof of concept and a well thought out design document for the web service.
- (Week 7-8) Implement the API (or at least the important parts of it)
- Checkpoint 3 Almost done with the architectural phase, need to decide how to fit the web service into WMF transaction workflow. Deliverables The completed fraud detection web service.
- (Week 9-10) Integrate the API into WMF transaction workflow
- Communication: I usually try to be available on IRC during work hours and on mail the rest of my non-sleep time.
- Source code: The source code belongs in an independent repository. Although we might want to club it together with the ORES code later on.
- Progress reports: Weekly progress reports to the fundraising mail thread or on Wikimedia-Fundraising.
I'm currently a final year B.Tech. Computer Science & Engineering at IIT Guwahati, India. I started contributing to CiviCRM in 2015 and ended up doing a GSoC project with Eileen in 2016; she was the one who introduced me to the folks here at WMF. This project is going to be priority number one during my summer break as I don't have any pressing commitments during the same time.
For the past year I've been working on a thesis project on data science and information retrieval which involves machine learning techniques similar to the ones I want to use here. In addition to that I have considerable experience working with open source organizations - I was an active contributor to CiviCRM and a GSoC participant back in 2016.
Also, I'm comfortable adapting to new tech stacks and getting "code-ready" in a short period of time thanks to my internship at Google in 2017.
Machine Learning Techniques for Anomaly Detection
- Autoencoders: Autoencoders are neural nets that try to learn the underlying patterns in data in an unsupervised way. Outliers to these patterns are detected as anomalies. More details.
- Logistic Regression: Logistic regression tries to find the best (yet reasonable) fitting model to describe the relationship between a dependent variable (fraud/not fraud) and a set of independent variables (features). Outliers to these patterns are detected as anomalies.
- Supervised Learning using Classifiers: The problem with using supervised learning is that if for ex. a SVM guessed that transactions were never fraudulent, it would’ve been correct ~99.6% of the times on WMF’s transactions from 2017. A workaround is that we under-sample normal transactions such that frauds are not underwhelmingly less as compared to number of normal transactions. An ensemble of classifiers (think something which combines the outputs of multiple classifiers and then classifies the transaction as fraud/not fraud) should work even better than singular classifiers.
- An interesting one (just read the dataset description and conclusions if you don’t want to go through the entirety of it): http://www.wipro.com/documents/comparative-analysis-of-machine-learning-techniques-for-detecting-insurance-claims-fraud.pdf
- Radar is a proprietary software that does exactly what we’re trying to achieve: https://stripe.com/radar