Maniphest T190523

Machine Learning for Fraud Detection
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	saurabhbatra96
	Mar 23 2018, 2:12 PM

Description

Profile Information

Name: Saurabh Batra
IRC nick: saurabhbatra
Web Page: http://saurabhbatra96.github.io/
Resume: http://saurabhbatra96.github.io/public/cv.pdf
Location: India
Typical working hours: 12 PM - 10 PM UTC+5:30

Synopsis

The project aims to build a new open-source fraud detection system. The 2 major steps involved are:

experimenting with various anomaly detection techniques (see the ML section at the end) to figure out which one provides a required balance of precision (% of detected frauds which are actually fraudulent) and recall (% of all frauds detected);
providing the technique as an independent web service to WMF (like ORES) which can entertain requests to ascertain the authenticity of transactions.

Stretch Goals

The web service uses the feedback from its decisions (new correct detection/wrong detection corrected by a human) to train the underlying model, improving its accuracy in the future.
Use something like LIME to provide a justification as to why our classifier chose to mark a transaction as fraud.
CiviCRM extension to interface directly with the web service.

Possible Mentor(s) @Eileenmcnaughton , @awight
Have you contacted your mentors already? I've already worked with Eileen for about an year back in 2016 which included a GSoC project for CiviCRM and have discussed the proposal with Adam.

Timeline

I’m going to divide the work into 2 major phases:

Experimentation phase (May - mid June)
The experimentation phase will majorly consist of trying out the proposed techniques on the current dataset and comparing how they perform against each other and against the current fraud detection system. Tentative tasks include:

(Pre-Week 1 - Week 1) Dataset procurement and cleaning
(Week 1-2) Reading up and applying feature selection to the dataset
(Week 2-5) Reading up and applying anomaly detection techniques; comparing precision and recall scores; deciding on the best technique for the web service
Checkpoint 1 Experimentation phase finished; we should know which technique works the best. Deliverables Theoretical knowledge as to how the fraud detection system should be modeled along with a proof of concept.

Architectural phase (June - August)
The architectural phase involves integrating the best-performing technique with a web service. Tentative tasks include:

(Week 6) API design for the web service
(Week 6-7) Setting up the bare-bones architecture for the web service
Checkpoint 2 Mid-way through the architectural phase; we have the model and we have the web service separately. Deliverables: Proof of concept and a well thought out design document for the web service.
(Week 7-8) Implement the API (or at least the important parts of it)
Checkpoint 3 Almost done with the architectural phase, need to decide how to fit the web service into WMF transaction workflow. Deliverables The completed fraud detection web service.
(Week 9-10) Integrate the API into WMF transaction workflow

Participation

Communication: I usually try to be available on IRC during work hours and on mail the rest of my non-sleep time.
Source code: The source code belongs in an independent repository. Although we might want to club it together with the ORES code later on.
Progress reports: Weekly progress reports to the fundraising mail thread or on Wikimedia-Fundraising.

About Me

I'm currently a final year B.Tech. Computer Science & Engineering at IIT Guwahati, India. I started contributing to CiviCRM in 2015 and ended up doing a GSoC project with Eileen in 2016; she was the one who introduced me to the folks here at WMF. This project is going to be priority number one during my summer break as I don't have any pressing commitments during the same time.

Past Experience

For the past year I've been working on a thesis project on data science and information retrieval which involves machine learning techniques similar to the ones I want to use here. In addition to that I have considerable experience working with open source organizations - I was an active contributor to CiviCRM and a GSoC participant back in 2016.

Also, I'm comfortable adapting to new tech stacks and getting "code-ready" in a short period of time thanks to my internship at Google in 2017.

Other Info

Machine Learning Techniques for Anomaly Detection

Autoencoders: Autoencoders are neural nets that try to learn the underlying patterns in data in an unsupervised way. Outliers to these patterns are detected as anomalies. More details.
Logistic Regression: Logistic regression tries to find the best (yet reasonable) fitting model to describe the relationship between a dependent variable (fraud/not fraud) and a set of independent variables (features). Outliers to these patterns are detected as anomalies.
Supervised Learning using Classifiers: The problem with using supervised learning is that if for ex. a SVM guessed that transactions were never fraudulent, it would’ve been correct ~99.6% of the times on WMF’s transactions from 2017. A workaround is that we under-sample normal transactions such that frauds are not underwhelmingly less as compared to number of normal transactions. An ensemble of classifiers (think something which combines the outputs of multiple classifiers and then classifies the transaction as fraud/not fraud) should work even better than singular classifiers.

Additional Links

https://blog.codecentric.de/en/2017/09/data-science-fraud-detection/
http://ieeexplore.ieee.org/document/8123782/?reload=true
An interesting one (just read the dataset description and conclusions if you don’t want to go through the entirety of it): http://www.wipro.com/documents/comparative-analysis-of-machine-learning-techniques-for-detecting-insurance-claims-fraud.pdf
Radar is a proprietary software that does exactly what we’re trying to achieve: https://stripe.com/radar
https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/

Related Objects

Mentioned In: T190103: GSOC proposal - Machine learning fraud detection

Event Timeline

saurabhbatra96 created this task.Mar 23 2018, 2:12 PM

saurabhbatra96 mentioned this in T190103: GSOC proposal - Machine learning fraud detection.Mar 23 2018, 2:14 PM

srishakatux moved this task from Backlog to Proposals In Progress on the Google-Summer-of-Code (2018) board.Mar 23 2018, 7:32 PM

@saurabhbatra96 The proposal and timeline look very good, I only have two suggestions:

(Week 1) Dataset procurement and cleaning

I imagine data collection will take longer, unless there's something ready to go already? Even after the initial data is collected, we'll probably come back and iterate later.

Can you identify one or two "short" goals which might give added value even if the full project can't be completed in our timetrame?

I imagine data collection will take longer, unless there's something ready to go already? Even after the initial data is collected, we'll probably come back and iterate later.

So GSoC has a community bonding period from April 23 to May 14; I was counting week 1 to be the first week of coding, i.e. the week starting on May 15 so we should have about 3-4 weeks to collect and sanitize the data. Still, I should probably mention this in the proposal, it's not very clear from the timeline.

Can you identify one or two "short" goals which might give added value even if the full project can't be completed in our timetrame?

On it.

saurabhbatra96 updated the task description. (Show Details)Mar 25 2018, 11:56 AM

@awight I've added some checkpoints to help us evaluate exactly what we should be aiming to achieve in order to deem the project "complete" , if not a "success", if we come up against tough to cross hurdles.

• DStrine moved this task from Triage to Blocked or not fr-tech on the Fundraising-Backlog board.Mar 26 2018, 7:36 PM

@saurabhbatra96 I was asked on the CiviCRM channel if machine learning could also be used to predict donor behaviour (e.g likelihood of donating again or increasing amount) - I'm not expecting this to be in scope but mentioning it to see if the possibility of meeting this 'smaller size org' feature request affects design

@Eileenmcnaughton I think that is quite possible to do within the current design because all we have is a machine learning model interfacing externally as an API. It makes no difference to the architecture whether the model predicts donor behavior or frauds. The extra work, of course, is in implementing a model to predict donor behavior which is currently out of scope but anyone familiar with machine learning can implement.

Jatin0312 moved this task from Proposals In Progress to Proposals Submitted on the Google-Summer-of-Code (2018) board.Apr 2 2018, 1:09 PM

Jatin0312 moved this task from Proposals Submitted to Proposals In Progress on the Google-Summer-of-Code (2018) board.Apr 2 2018, 1:22 PM

Meeting 1 discussion points -

Find out the original dataset for https://www.kaggle.com/mlg-ulb/creditcardfraud
Postpone feature selection to the integration phase.
Ideas about semi-supervised models. That can predict fraud/not fraud/maybe fraud. Pass the maybe fraud values to MinFraud.
Idea about having separate models filtered on the basis of location (IP address filtering) with varying degrees of threshold scores.
This week's work - experimentation with classical classifiers (SVM, Log. regression etc.)

srishakatux closed this task as Declined.Jun 5 2018, 6:32 PM

Hi @srishakatux , reopening this because the project is still underway and we're tracking here, removed the GSoC tag though.

awight added a project: Machine-Learning-Team.Jun 13 2018, 9:35 AM

awight moved this task from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.

Updates -

ML code snippets are being tracked here - https://github.com/saurabhbatra96/wmf-samplecodes
PR plots for various classifiers with the dummy dataset - https://github.com/saurabhbatra96/wmf-samplecodes#classifier-comparisions
Completed the code for classifier parameter optimisation.
Experiments with LIME to explain classifier results - https://github.com/saurabhbatra96/wmf-samplecodes/blob/master/classifier-explainer.ipynb

WIP -

API Design + Flask code
WMF data access

Pending -

Feature selection and normalisation for WMF data
ML model + API integration
API + WMF transaction flow integration

Tracking API frontend code here - https://github.com/saurabhbatra96/wmf-fd-api

Yay!

The API seems to be functioning as required - https://github.com/saurabhbatra96/wmf-fd-api.
The model is making accurate predictions on new data (August fraud).
FR-tech is aware of all progress made on this.

Since we've successfully reached checkpoint 3 I'm closing this. I propose we start a new issue for integration with Donor Services.

Machine Learning for Fraud DetectionClosed, ResolvedPublicActions