Credit Card Fraud Detection

Source: Deep Learning on Medium

Credit Card Fraud Detection

Abstract

In data mining, anomaly detection means to search or scan for a data point, item or record which do not match or conform to expected pattern, trend or to other data points in dataset. So, most of the time these data points or records are considered as defects, outliers, errors or frauds. There are various machine learning anomaly detection algorithms which enhance the speed of detecting of these outliers. These anomaly detection algorithms are used for detecting invasions, as while detecting outliers and can also prevent attacks, defects, faults and so on. Various companies, organization or institution adapted and implement these algorithms with simple yet effective approach for detecting and classifying these anomalies. Machine learning algorithms have the ability to learn from data and make predictions based on that data. Since basic machine learning involves learning from data and predict the data but anomaly detection algorithms specifically learn or work on these outliers. It provides an alternative for detection and classification of anomalies based on an initially large set of features. Anomaly detection or outlier detection is the recognition of unalike data, records or observations which raise doubts by differing significantly from the majority of the data.

Introduction

Credit card fraud is a serious and global issue or crime committed by frauds using payment card such as credit card or debit card. The purpose of these fraudsters is to acquire goods without paying, or to acquire unauthorized funds from an account. Credit card fraud also give rise to identity theft. According to the some reports and statistics, while the rate of identity theft has been steadily increased by 21 percent in 2008. However, credit card fraud, that crime which leads to ID theft, decreasing as a percentage of all ID theft complaints for many years. Although only 0.1% of card holders are aware of credit card frauds. These credit card frauds have resulted in huge financial losses as the fraudulent transactions have been large value transactions. In the year 1999, 10 million transactions out of 12 billion turned out to be fraudulent. Also, every 4 out of every 10,000 active accounts are fraudulent. Current fraud detection systems are only able to prevent 1/12th of 1% of all transactions processed which still leads billions of dollars in losses.

Methodology

The major aspect of this project to develop a best suited algorithm to find the outliers or frauds in case of credit cards. We will implement several machine learning and deep learning algorithms and compare them and choose the best algorithm.

We will implement algorithms like:

• Neural Network

• Isolation Forest

• OneClassSVM

• Local Outlier Factor

For this purpose, we used an existing dataset. The dataset contains information about transactions made by various cardholders. The dataset composed of around 300,000 records out of which there are only around only 500 fraudsters. So, this shows that dataset is highly imbalanced as the positive class or frauds are only 0.172% of all transactions. All the features columns are numeric which are result of PCA transformation. Hence their value ranges from -1 to 1. Features columns V1, V2, V3… V28 are obtained as result of PCA transformation. Columns like Time and Amount have not been transformed. Feature Class column is the classification variable which contain value 0 (Normal Case) and 1 (Fraud).

For Dataset click here

Implementation

1. Artificial Neural Network:

ANN are concept of deep learning which are implement using keras (in this case). ANN are composed of neurons. First layer or Input layer is the input neuron which consist the transaction and amount of each customer. The hidden layer consists of weights, bias and activation function. We can add as much hidden layer for tuning the performance. In this case we are using 3 layers. The output layer is the final layer where we get the classified output. The output either be 1 or 0 where 1 indicate fraud case and 0 indicate normal.

1) Data Processing:

• Libraries Used for Data Preprocessing: Pandas, NumPy

• Operations applied: Featuring Scaling, PCA

• Columns Dropped: Time

2) Visualization:

• Libraries Used for Visualization: Matplotlib

• Histogram Plot of Class Columns

Class 0: 284315 Class 1: 492

Countplot of Class Column
  • Catplot Plot of Class and Normalized Amount Columns
Scatterplot of Class Columns

3) Model (Neural Network):

• Framework Used for Neural Network: Keras

4) Model Evaluation:

• Libraries Used for Model Evaluation: Confusion Matrix, Classification Report

• Accuracy Score:

99% of accuracy in predicting the cases, 0.4% of cost

5) Final Report:

• No. of lines of code: 80

• Source Code Memory: 108 kb

• Time Taken: 6:16:46 (6 minute 16 seconds 46 nanoseconds) (May vary in different system)

• Accuracy in Predicting Normal Cases: 99%

• Accuracy in Predicting Outliers: 76%

6) Source Code:

• For ipynb file, click here

  • Place it in your home directory of anaconda and run the file using Jupyter Notebook.

2. Anomaly Detection Algorithms

Anomaly Detection algorithms are used to detect or identify unusual patterns that are different from expected trend and behavior. These cases are called outliers. It has been used in business, from invasion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring like detecting malignant tumor in an MRI scan and also help in fraud detection in credit card transactions to fault detection in operating environments.

1) Data Processing:

• Library used for data processing and manipulation: Pandas, Numpy

• We will divide the dataset into 3 parts- Normal outliers, Normal Train, Normal Test

2) Visualization:

• Library used: Matplotlib

3) Model Implementation:

a) Isolation Forest: Isolation forest is a tree algorithm which detects anomalies by randomly partitioning not on basis of information gain as like of decision tree. Partitions are created by randomly selecting a feature and then randomly creating a split value between the maximum and the minimum value of the feature. We keep on creating the partitions until we isolate all the points. In most cases we also set a limit on number of partitions/heights of the tree.

• Libraries used for model building: Isolation Forest from sklearn.ensemble

  • Hyperparameters Used And their Values:

Evaluation and Conclusion:

• Accuracy on predicting Test cases: 85%

• Accuracy on predicting Outliers: 90%

• Time Taken for compilation: 1 Min

  • Source Code Memory: 70 kb

b) Local Outlier Factor: It is an anomaly detection unsupervised algorithm which computes the local density deviation of a given data point with respect to its neighbors. It recognizes data point or record as outliers which have a substantially lower density than their neighbors. When LOF is used for outlier detection it has no predict, decision function and score samples methods. The number of neighbors considered (parameter n_neighbors) is typically set 1) greater than the minimum number of samples a cluster has to contain, so that other samples can be local outliers relative to this cluster, and smaller than the maximum number of close by samples that can potentially be local outliers.

• Libraries Used for Implementing Algorithm: LocalOutlierFactor from sklearn.neighbors

  • Hyperparametres and values used:

Evaluation and Conclusion:

• Accuracy on predicting the cases: 89%

• Time Taken for compilation: 35 Min

  • Source Code Memory: 70 kb

c) OneClassSVM: A One-Class Support Vector Machine is an unsupervised learning algorithm that is trained only on the one type of class. This module is particularly useful in scenarios where you have a lot of “normal” data and not many cases of the anomalies you are trying to detect. An SVM model is based on dividing the training sample points into separate categories by as wide a gap as possible, while penalizing training samples that fall on the wrong side of the gap. The SVM model then makes predictions by assigning points to one side of the gap or the other. In one-class learning we train the model only on the positive class data-set and take judgments from it

• Libraries Used for Implementing Algorithm: OneClassSVM from svm

• Hyperparametres and Values used:

Evaluation and Conclusion:

Accuracy in predicting Test Cases: 45%

Accuracy in predicting Outliers: 91%

Time Taken: 30 Min

Source Code Memory: 70 kb

Source Code:

• For ipynb file, click here

  • Place it in your home directory of anaconda and run the file using Jupyter Notebook

d)DBSCAN: Density-Based Spatial Clustering of Applications with Noise. DBSCAN is anomaly detection algorithm which uses clustering. In this method, we calculate the distance between points (the Euclidean distance or some other distance) and look for points which are far away from others. DBSCAN works bottom up approach that consider those data point which are close to each other. Clusters with few points in them are considered outliers. DBSCAN detect the outliers on time series in simplified form. We consider each host to be a point in d-dimensions, where d is the number of elements in the time series. After calculating the distance between data points a cluster is formed and data points that are not in the largest cluster will be considered an outlier.

• Libraries Used for Implementing Algorithm: DBSCAN from sklearn.cluster

  • Hyper Parameters and Values used:

Evaluation and Conclusion:

• Accuracy in predicting All cases: 99%

  • Accuracy in predicting outliers: 0%

• Source Code Memory: 28 kb

Source Code:

• For ipynb file, click here

  • Place it in your home directory of anaconda and run the file using Jupyter Notebook

Conclusion

Several algorithms have been implemented on same data set to detect the credit cards frauds. All the algorithms have been analyzed and compared on basis of accuracy on basis of predicting normal cases and outliers or frauds. We implemented different type of algorithms which include neural network from deep learning, anomaly detection algorithms like isolation forest, OneClassSVM, Local Outlier Factor, supervised algorithm like DBSCAN. This was done to attain the best approach for the purpose. Upon analyzing we get to know that 3-Layer Neural Network and DBSCAN is spot on predicting the normal cases with accuracy of 99% but in case of predicting the outliers thay are not as good as anomaly detection algorithms. Isolation forest and OneClassSVM algorithm is giving impressive accuracy of 91% on predicting the outliers but in case of predicting the normal case they have less accuracy as compared to neural networks. In case of time taken neural networks and isolation forest algorithms are very impressive. In future this type of algorithm can be used in different cases. For better performance we can change the layers properties neural networks for better results.