Original article was published by VICTOR BASU on Deep Learning on Medium
Unsupervised approach for detection of threat
In Unsupervised approach we do not let our model learn through the target variables, rather we force our algorithm to learn from input data itself and discover patterns and information on its own.
Pre-processing before training. We have removed some of the features from our data like Flow ID’, ‘ Source IP’, ‘ Source Port’,’ Destination IP’, ‘ Destination Port’,’ Timestamp’,’ Flow Packets/s’, ‘Flow Bytes/s’. ‘Flow plackets/s’ and ‘Flow Bytes/s’ were removed because after standard scaling these features transformed into values too large for float64 and NaN values.
We scaled our data through standard scaling and followed by normalization. Use Principle component analysis for dimension reduction and reduced the dimension into two-dimensional data.
So, from the above two visualizations, it could be clearly observed that our algorithm could successfully cluster out the different threats from the data to some extent.
Let’s see how our unsupervised model could label the generated clusters.
Well, it looks like our unsupervised model has successfully found the pattern in the data and could segment out our target variable on its own to some extent.
Note — Unsupervised learning gives you a detailed and analyzed insight about the shape and structure of the data. The unsupervised clustering and target label prediction from data would changes when the shape and structure of the data changes as it is unaware of the target data might be. There is no way to determine how accurate they are, making supervised machine learning more applicable to real-world problems. It is also one of the reasons why unsupervised trained models are not appropriate deploying over production.
Supervised approach for detection of threat
It is just opposite to the unsupervised approach, here we let our model learn through target variable which further helps our model to learn the pattern from the data through the target labels. We applied the same pre-processing of data as we did for the unsupervised approach. In this case, we have used deep learning to train our model.
Structure of our DL model
As our target variable was imbalance so we have used Stratified K-fold to train and validate our data over each fold. It balances the distribution of training and validation with respect to a desired unbalanced feature.
We have used Adam as our base optimizer and ROC_AUC score to evaluate the performance of the model. ROC_AUC score Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
We have trained and validated our model over 10 folds and we have achieved ROC_AUC score of 96% and above over an average for detection of threat and achieved the highest accuracy 97% and above.