Credit Card Transaction Fraud Detection with Deeplearning4j

Original article can be found here (source): Deep Learning on Medium

In this post, we will build an autoencoder in Deeplearning4j to detect fraud credit card transaction. We will learn how to deal with unbalanced dataset which is very common in anomaly detection applications.

Background

Anomaly is a broad term denoting something that deviates from what is standard or expected. Often, identifying an anomaly event in the early stage is challenging with the conditions of recognizing one being vague or unclear.

Hence, anomaly detection modelling with deep learning normally starts from spotting samples of data with unusual trends and patterns, marking these being the interested ones, and further inspect to verify the anomaly events the later stages.

Another distinctive feature of anomaly event problem is the dataset is inherently unbalanced, with the the proportion of positive labelled samples being far lesser compared to negative labelled samples.This scenario is common in the real world problem, where anomaly event happens far less frequently compared to the rest.

When selecting neural networks architecture, the highly unbalanced dataset need to be factored in. The vanilla Multi-Layer Perceptron Neural Network would not be appropriate as the model will be highly biased towards dataset with larger samples, propagating bias and prejudice in return.

Autoencoder are commonly used for unbalanced dataset and good at modelling anomaly detection such as fraud detection, industry damage detection, network intrusion. This will be further elaborated in the next section.

Introduction to Autoencoder

Autoencoder learns in an unsupervised manner to create a general representation of the dataset. As seen in Figure 1 below , the network has two important modules, an encoder and a decoder. The encoder achieve the purpose of dimensionality reduction by learning the underlying patterns of the dataset in the compressed form. This follows by a decoder which does the reverse by producing the uncompressed version of the data.

Figure 1. Illustration of an autoencoder.

Essentially, the bottleneck layer sandwiched between the encoder and decoder is the key attribute of the network design. It restricts the amount of information traversing from encoder to decoder, forcing the network to learned the low dimensional representation of the input data. With the output layer having the same number of units as the input layer, the network learns to develop a compressed and efficient representation of the input data through minimizing the reconstruction error.

Implementation

Problem Statement

In this example, we illustrate anomaly detection with the use case of credit card fraud detection. The dataset used in this example can be retrieved from Kaggle. https://www.kaggle.com/mlg-ulb/creditcardfraud

In this example, the non-fraud dataset is segmented to a sample set of 108000 data points and the fraud dataset to a size of 490 data points from the original dataset. The dataset is highly unbalanced with 490 frauds identified out of the whole dataset (about 0.451% of all transactions). The distribution of the data is shown in the figure below, where you can see the amount of fraud data is heavily unproportioned compared to the normal data.

Figure 2. Distribution of normal to fraud data.

The highly unbalanced dataset makes it suitable to be modeled with autoencoder, where the model will learn the general representation of the data and identify out data with uncommon patterns. LSTM autoencoder is used in this example to capture the time series nature of the transaction data. If you want to learn more about LSTM, here’s a really good link to learn more about it.

Code

The program can be found on Github Repository.

The readme in the subfolder shows how to run the modelling of the program. Note: the time needed to run the program varies depending on the computation backends. Choose for CUDA with CuDNN backend for a faster computation.

Snippets of code is demonstrated below for explanation purpose.

Data Preprocessing

The original dataset is in a file where most transactions features are principal components obtained from the technique Principal Component Analysis (PCA). There are another two features ‘Time’ and ‘Amount’ which have not been transformed. The last column of the dataset denotes the ground truth of ‘fraud’ or ‘non-fraud’.

To proceed with our analysis, the ‘Time’ column is removed because the feature does not need to be factored in and the ‘Amount’ column is normalized within the range of 0 and 1. Each sample datapoint of dataset is further separated into a file, stored in either the fraud or non-fraud directory depending on the ground truth labels. The label of the sample is stored in another label directory with the same file name.

During training, we segregate out a big portion of non-fraud data (100 000 samples) to be modeled by the autoencoder. The rest of non-fraud data (8000 samples)and fraud data(490 samples) is used during the testing and validation phases.

As a result, we have 4 data directory: train_data_feature, train_data_label, test_data_feature, and test_data_label. For example, 0.csv in directory test_data_feature store one sample data point of credit card transactions with 0.csv in directory test_data_label stored the label (fraud/non-fraud) of the corresponding transaction.

Fig 3. File directory paths for training and testing dataset

Note: Autoencoder is commonly used in unsupervised learning where data with no labels is provided and trained. In this example, autoencoder is used as semi-supervised learning due to the presence of labels. We only train the autoencoder with the “non-fraud” data, presumably it will leads to a better generalization of the representation. During testing, data with high reconstruction error will be categorized as high risk of frauds.

Data Vectorization

The data stored in csv is further read in using CSVSequenceRecordReader and SequenceRecordReaderDataSetIterator as shown in Figure 4. Training dataset is grouped to a minibatch size of 284 while testing dataset contains of a single data point to suit the evaluation purpose later. Note that the training data in this example only contains non-frauds data.

Fig 4. Vectorization of data

Network Architecture

Next, LSTM autoencoder is formed as shown in Fig 5. Each layer is constructed using LSTM layer to learn the correlation of time series data. Note that Mean Square Error function is used in the out-most layer as autoencoder build upon reconstructing the input data with a value of the reconstruction error.

Fig 5. LSTM Autoencoder Architecture

The network started with a LSTM layer of 29 units which is the feature length of input data. For the encoder part, the neural network nodes is then reduced to 16 and eventually 8 in the subsequent layers. This is the bottleneck of the autoencoder to capture the essential representation of the data. The structure of the decoder is the opposite of the former encoder, where it is constructed with 8 nodes, follows by 16 nodes, in the attempt to reconstruct the input data. The network is completed with output layer of 29 units, which is identical with the number of nodes of input layer. These information is summarized in Fig 6.

Fig 6. LSTM Autoencoder Network Configuration

Training

After the configuration setup, the network is then trained with 1 epoch where minibatch data is feed in subsequently. Features of the data is fed in as both the input and output due to the nature of autoencoder in reconstructing data while minimizing reconstruction error.

Fig 7. Training of the LSTM Autoencoder Network

The illustration through DL4J training user interface reveals that the training is heading to a smaller loss value, shows that the network is converging.

Fig 7. Deeplearning4j Training User Interface.

Evaluation Results

After the training, the network is evaluated using the testing dataset. Testing data point is feed in the network, where data point with high reconstruction error is labelled as predicted fraud. The threshold value set for this example is 2.5.

It is seen that the network can correctly identify 415 number of frauds out of a total of 490 cases (about 0.847% of all frauds). The network also able to recognize 27479 normal transactions out of a total of 28431 cases.

Fig 8. Evaluation Results.

What’s next

I’ll continue to post articles about modelling of neural networks on various use cases so stay tune!