Original article was published on Deep Learning on Medium
Deep Auto Encoder Network: A Complete Guide For Anomaly Detection in Tensorflow
Imagine a shelf is cleanly adorned by bottles of Irish whiskey and you see a milk bottle there. :D. Ahh…That is odd. Isn’t it? So That odd, what we refer as an Anomaly.
In real life, we come across varieties of anomaly scenarios where certain entities deviate from the actual pattern they are supposed to follow. One of the significant examples of anomaly is fake credit card transactions which we are going to analyze today. There are certainly many more examples in real life where Anomaly detection is applicable like intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor networks, and detecting ecosystem disturbances.
What we analyze is a data of a credit card transaction which contains transaction details of cardholders. It contains the details of whether a transaction is a normal transaction or a fraud transaction that we can use as our dependent data for our model. The source of the data is Kaggle. Before moving, many of you must be having a thought, why cannot it be categorized as a classification problem? Certainly, it can be, but the main reason why we are addressing it as an anomaly is almost 82% of the transactions are normal and 18% have deviated from the actual normal transactions. Since the distribution among the classes is clearly not symmetric and one class is almost occupying the majority of the data, we can state the other class as an anomaly that demands an anomaly detection problem.
Preprocessing is an important step before feeding the data into any machine learning model. We need to understand the data. How many columns we are dealing with? Are all the columns contain numerical value? Is there any categorical text value that we need to convert into numerical values? Are there any missing values? All these checks must be taken into consideration before we move ahead.
Luckily, the data we have is absolutely a clean set of data. As you can see there are no Null values at all.
There are 31 fields including the ‘Class’ notifying whether a transaction is Normal or Fraud. You can play around with the data to investigate more about it. Let’s see how many Normal and Fraud transactions present in the set.
It clearly says more than 80% of the data belongs to Normal class and odd 20% is Fraud. Diving more deeper into the data, we have segregated the Normal and Fraud transactions and evaluate the Amount field for both the class. And by looking at the details you clearly cannot say if there is some discrepancy.
Let us visually see how the values of certain fields are distributed. If we consider fields like V4, V7, V9, and V10 we can see the distribution as below.
Now individually if we see the distribution of V4 or V10 field for both Normal and Fraud transactions we can come across the below findings.
The Green distribution belongs to Normal transactions and there one belongs to Fraud. Here we can see the Value of V10 for Normal transaction spreads over -20 to 3 but most of the fraud transactions have a V4 value close to zero. It is an interesting finding. Isn’t it? You can analyze the details for more fields and see the trend.
Let’s now understand what is Autoencoder.
Autoencoder is a neural network architecture which works on unsupervised learning technique to reconstruct the input values. In simple terms, we recreate the input value X given to a neural network with minimal error. The below image explains it better.
In autoencoder, the input data that we give is basically compressed through a bottleneck in the architecture as we impose a lesser number of neurons in the hidden layers. As in the above diagram, the network takes an unlabelled data as input X and learns to output X̂ a reconstruction of original input by framing it as a supervised problem. The main idea of the network is to minimize the reconstruction error L (X, X̂) which is basically the difference between original input to the reconstructed output.
As we limit the number of layers in the hidden layer limited amount of information can flow through the network else if we give the same number of neurons in the hidden layer model will memorize the input data along with the network without learning important attributes about the input. Limiting neurons in the hidden layers will force the network to update the weights by being efficiently penalized according to the reconstruction error. The encoder state (as seen in the above fig.) actually helps the network to understand the underlying pattern of the input data and later on decoder layer learn how to re-create the original data from the condensed details.
A simple question. How a model which is designed to reconstruct the input value can be used as a classifier to detect any anomaly like fraud transaction?
So the answer is, the Reconstruction error helps us to achieve the same. If we train a model to reconstruct non-fraudulent transactions, the network adjusts its weights for non-fraudulent transactions. When data of a fraud transaction been fed to the network then the mean squared error(MSE) of the output will be relatively higher for that input. And if the MSE is higher than a set threshold then we can easily classify that input as an anomaly. I hope that makes sense now.
But wait….!!! What is this threshold now? How do we know a perfect threshold?
Let’s Find out 😊
Data Preparation for Training and Testing:
We have taken simple three steps prior to feeding the data into our model.
1. Standardized the data with min-max scaler
2. Kept 90% of Non-fraudulent transaction for training
3. Combine 10% of Normal transactions and entire fraud transactions to create our Validation and Test data.
After all the requisite pre-processing we finally will create the autoencoder model. We have used Tensorflow 2.0 to create our model. If you are not familiar with Tensorflow architecture I would suggest starting with Tensorflow official offerings [here].
It might look intimidating at first if you are not well versed with TensorFlow but trust me it is not. We have used the Mean Squared Error loss function to calculate the Reconstruction error we have discussed above. To my design decisions, I have used Relu and Sigmoid as the activation functions. You are absolutely free to experiment with the hyperparameter choices for your model.
After all the set up we are now ready to train our model. After 50 epochs we can see the model has an accuracy of 99% with very minimal Loss.
Once our model got trained and saved, we will use our Validation set to validate how well our data is performing.
Initially, we have done the data split and kept validation test data ready. If we inject the data into our trained model, we would be looking at results something like below.
If you remember in our validation dataset, we have both have normal transactions and Fraud transactions. Let’s see how the MSE has been calculated by the model for both the classes of transactions.
Here if you see, for non-fraudulent transactions the average error is 0.000372. 50 percentile value is the median and the value is 0.000235. Subsequently, if we see the upper (75) percentile value is close to 0.0004. Which describes the MSE for the normal transaction is very minimal.
Whereas if we see the details of Fraud transactions, we can clearly mark the error is almost 10 times higher than the normal transactions. Here the lower percentile of error value is 0.003 so we can guess anything above 0.001 can be a Fraudulent transaction as more than 75% values in normal transactions have an MSE less than 0.001.
Mark that deciding a threshold can be a trial and error method. We can always play around with the value. But our initial point, in this case, is 0.001.
A visual interpretation of how reconstruction error value is distributed for Fraud and Normal transactions.
So by exploiting the error details we got a threshold value for MSE which suggests any transaction fed into the network if it gives an error more than a threshold value that will be considered as a fraud transaction.
Let’s now calculate on the same data set how many of the transactions we have successfully categorized correctly.
After comparing the error with the threshold value below is the confusion matrix we get for our results. This says we majority of the Normal transactions and Fraud transactions are correctly classified but still if we want to minimize the numbers for wrongly selected fraud transactions can try to set the threshold accordingly and see how it behaves.
Let us now draw the ROC curve according to the classification we have done earlier.
And if you see as per the ROC curve it has an AUC of 0.9560 which states our model is doing great in classifying the details.
Finding anomaly is one of the interesting tasks for any problem solver and there is an ample number of models being used in industry for the task. Here we have seen how an Autoencoder can also be used as a classifier that can pick any disarrangement in a dataset which is a deviation from usual. I hope you have learned something great today. I have posted the entire code in my Git [link]. Please refer to the same if you need any references.