Page MenuHomePhabricator

Machine Learning for Fraud Detection
Closed, ResolvedPublic

Description

Profile Information

Name: Saurabh Batra
IRC nick: saurabhbatra
Web Page: http://saurabhbatra96.github.io/
Resume: http://saurabhbatra96.github.io/public/cv.pdf
Location: India
Typical working hours: 12 PM - 10 PM UTC+5:30

Synopsis

The project aims to build a new open-source fraud detection system. The 2 major steps involved are:

  • experimenting with various anomaly detection techniques (see the ML section at the end) to figure out which one provides a required balance of precision (% of detected frauds which are actually fraudulent) and recall (% of all frauds detected);
  • providing the technique as an independent web service to WMF (like ORES) which can entertain requests to ascertain the authenticity of transactions.

Stretch Goals

  • The web service uses the feedback from its decisions (new correct detection/wrong detection corrected by a human) to train the underlying model, improving its accuracy in the future.
  • Use something like LIME to provide a justification as to why our classifier chose to mark a transaction as fraud.
  • CiviCRM extension to interface directly with the web service.

Possible Mentor(s) @Eileenmcnaughton , @awight
Have you contacted your mentors already? I've already worked with Eileen for about an year back in 2016 which included a GSoC project for CiviCRM and have discussed the proposal with Adam.

Timeline

I’m going to divide the work into 2 major phases:

Experimentation phase (May - mid June)
The experimentation phase will majorly consist of trying out the proposed techniques on the current dataset and comparing how they perform against each other and against the current fraud detection system. Tentative tasks include:

  • (Pre-Week 1 - Week 1) Dataset procurement and cleaning
  • (Week 1-2) Reading up and applying feature selection to the dataset
  • (Week 2-5) Reading up and applying anomaly detection techniques; comparing precision and recall scores; deciding on the best technique for the web service
  • Checkpoint 1 Experimentation phase finished; we should know which technique works the best. Deliverables Theoretical knowledge as to how the fraud detection system should be modeled along with a proof of concept.

Architectural phase (June - August)
The architectural phase involves integrating the best-performing technique with a web service. Tentative tasks include:

  • (Week 6) API design for the web service
  • (Week 6-7) Setting up the bare-bones architecture for the web service
  • Checkpoint 2 Mid-way through the architectural phase; we have the model and we have the web service separately. Deliverables: Proof of concept and a well thought out design document for the web service.
  • (Week 7-8) Implement the API (or at least the important parts of it)
  • Checkpoint 3 Almost done with the architectural phase, need to decide how to fit the web service into WMF transaction workflow. Deliverables The completed fraud detection web service.
  • (Week 9-10) Integrate the API into WMF transaction workflow

Participation

  • Communication: I usually try to be available on IRC during work hours and on mail the rest of my non-sleep time.
  • Source code: The source code belongs in an independent repository. Although we might want to club it together with the ORES code later on.
  • Progress reports: Weekly progress reports to the fundraising mail thread or on Wikimedia-Fundraising.

About Me

I'm currently a final year B.Tech. Computer Science & Engineering at IIT Guwahati, India. I started contributing to CiviCRM in 2015 and ended up doing a GSoC project with Eileen in 2016; she was the one who introduced me to the folks here at WMF. This project is going to be priority number one during my summer break as I don't have any pressing commitments during the same time.

Past Experience

For the past year I've been working on a thesis project on data science and information retrieval which involves machine learning techniques similar to the ones I want to use here. In addition to that I have considerable experience working with open source organizations - I was an active contributor to CiviCRM and a GSoC participant back in 2016.

Also, I'm comfortable adapting to new tech stacks and getting "code-ready" in a short period of time thanks to my internship at Google in 2017.

Other Info

Machine Learning Techniques for Anomaly Detection
  1. Autoencoders: Autoencoders are neural nets that try to learn the underlying patterns in data in an unsupervised way. Outliers to these patterns are detected as anomalies. More details.
  2. Logistic Regression: Logistic regression tries to find the best (yet reasonable) fitting model to describe the relationship between a dependent variable (fraud/not fraud) and a set of independent variables (features). Outliers to these patterns are detected as anomalies.
  3. Supervised Learning using Classifiers: The problem with using supervised learning is that if for ex. a SVM guessed that transactions were never fraudulent, it would’ve been correct ~99.6% of the times on WMF’s transactions from 2017. A workaround is that we under-sample normal transactions such that frauds are not underwhelmingly less as compared to number of normal transactions. An ensemble of classifiers (think something which combines the outputs of multiple classifiers and then classifies the transaction as fraud/not fraud) should work even better than singular classifiers.
Additional Links

Event Timeline

@saurabhbatra96 The proposal and timeline look very good, I only have two suggestions:

(Week 1) Dataset procurement and cleaning

I imagine data collection will take longer, unless there's something ready to go already? Even after the initial data is collected, we'll probably come back and iterate later.

Can you identify one or two "short" goals which might give added value even if the full project can't be completed in our timetrame?

I imagine data collection will take longer, unless there's something ready to go already? Even after the initial data is collected, we'll probably come back and iterate later.

So GSoC has a community bonding period from April 23 to May 14; I was counting week 1 to be the first week of coding, i.e. the week starting on May 15 so we should have about 3-4 weeks to collect and sanitize the data. Still, I should probably mention this in the proposal, it's not very clear from the timeline.

Can you identify one or two "short" goals which might give added value even if the full project can't be completed in our timetrame?

On it.

@awight I've added some checkpoints to help us evaluate exactly what we should be aiming to achieve in order to deem the project "complete" , if not a "success", if we come up against tough to cross hurdles.

@saurabhbatra96 I was asked on the CiviCRM channel if machine learning could also be used to predict donor behaviour (e.g likelihood of donating again or increasing amount) - I'm not expecting this to be in scope but mentioning it to see if the possibility of meeting this 'smaller size org' feature request affects design

@Eileenmcnaughton I think that is quite possible to do within the current design because all we have is a machine learning model interfacing externally as an API. It makes no difference to the architecture whether the model predicts donor behavior or frauds. The extra work, of course, is in implementing a model to predict donor behavior which is currently out of scope but anyone familiar with machine learning can implement.

Meeting 1 discussion points -

  • Find out the original dataset for https://www.kaggle.com/mlg-ulb/creditcardfraud
  • Postpone feature selection to the integration phase.
  • Ideas about semi-supervised models. That can predict fraud/not fraud/maybe fraud. Pass the maybe fraud values to MinFraud.
  • Idea about having separate models filtered on the basis of location (IP address filtering) with varying degrees of threshold scores.
  • This week's work - experimentation with classical classifiers (SVM, Log. regression etc.)
saurabhbatra96 renamed this task from [GSoC 2018] Machine Learning for Fraud Detection to Machine Learning for Fraud Detection.Jun 11 2018, 2:49 PM
saurabhbatra96 reopened this task as Open.
saurabhbatra96 removed a subscriber: srishakatux.
saurabhbatra96 added a subscriber: srishakatux.

Hi @srishakatux , reopening this because the project is still underway and we're tracking here, removed the GSoC tag though.

Updates -

WIP -

  • API Design + Flask code
  • WMF data access

Pending -

  • Feature selection and normalisation for WMF data
  • ML model + API integration
  • API + WMF transaction flow integration

Yay!

Since we've successfully reached checkpoint 3 I'm closing this. I propose we start a new issue for integration with Donor Services.