A Beginners Guide to Artificial Neural Network using Tensor Flow & Keras

Original article was published by Angel Das on Artificial Intelligence on Medium


A Beginners Guide to Artificial Neural Network using Tensor Flow & Keras

Building a fraud detection model using Artificial Neural Network & fine-tuning Hyperparameters using RandomizedSearchCV

Photo by Kelly Sikkema on Unsplash

Introduction

ANNs (Artificial Neural Network) is at the very core of Deep Learning an advanced version of Machine Learning techniques. Artificial Neural Networks involve the following concepts. The input & the output layer, the hidden layers, neurons under hidden layers, forward propagation, and backward propagation. In a nutshell, the input layer is the set of independent variables, the output layer represents the final output (the dependent variable), the hidden layers consist of neurons where equations are developed and activation functions are applied. The forward propagation talks about how equations are developed to achieve the final output, whereas the backward propagation calculates the gradient descent to updates the learning rates accordingly. More about the operational process can be found in the article below.

Deep Neural Network

When an ANN contains a deep stack of hidden layers, it is called a deep neural network (DNN). A DNN works with multiple weights and bias terms, each of which needs to be trained. In just two passes through the network, the algorithm can compute the Gradient Descent automatically. In other words, it can identify how each weight and each bias term across all the neurons should be tweaked to reduce the error. The process repeats unless the network converges to a minimum error.

Let’s run through the algorithm step by step:

  • Develop training and test data to train and validate the model output. Since it follows a parametric structure in which it optimizes the weight and bias parameter terms. All statistical assumptions involving correlation, outlier treatment remains valid and has to be treated
  • The input layer consists of the independent variables and their respective values. One mini-batch of data (depending on the batch size) passes through the full training set, multiple times. Each pass is called an epoch. The higher the epoch, the higher is the training time
  • Each mini-batch is passed to the input layer, which sends it to the first hidden layer. The output of all the neurons in this layer (for every mini-batch) is computed. The result is passed on to the next layer, and the process repeats until we get the output of the last layer, the output layer. This is the forward pass: it is like making predictions, except all intermediate results are preserved since they are needed for the backward pass
  • The network’s output error is then measured using a loss function that compares the desired output to the actual output of the network
  • The scientific contribution of every neutron to the error terms are calculated
  • The algorithm performs a Gradient Descent to tweak weights and parameters based on the learning rate (the backward propagation) and the process repeats itself

It is important to initialize all the hidden layers’ connection weights randomly, or else training will fail. For example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly identical, and thus backpropagation will affect them in exactly the same way, so they will remain identical. In other words, despite having hundreds of neurons per layer, your model will act as if it had only one neuron per layer: it won’t be too smart. If instead you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse team of neurons (Aurélien Géron, 2017, pp. 290–291)

Activation Functions

Activation functions are key to gradient descent. The gradient descent can’t move on a flat surface, hence it is important to have a well defined non-zero derivative to allow Gradient Descent to make progress at every step. Sigmoid is popularly used for logistic regression problems, however, there are other popular choices as well.

The Hyperbolic Tangent Function or Tanh

This function is S-shaped, continuous, and can be differentiated except for the fact that the output ranges from -1 to +1. At the beginning of the training, each layer’s output is more or less centered around 0, hence it helps in faster convergence.

The Rectified Linear Unit

A continuous function not differentiable at Z=0 and it’s derivative is 0 for Z<0. It produces good output and more importantly has faster computation. The function doesn’t have a maximum output, hence a few of the issues that can come up during Gradient Descent are handled well.

Why do we need an activation function?

Let’s say f(x) = 2x + 5 and g(x) = 3x -1. Equations for two different neurons, where x is the input variable 2 and 3 are weights and 5 and -1 are the bias terms. In chaining these function what we get is, f(g(x)) = 2(3x -1) + 5 = 6x + 3 which is again a linear equation. Absence of non-linearity is similar to having one equation in a deep neural network. Complex problem space in such scenarios can’t be handled.

Figure 1. Illustrates the activation functions commonly used in ANN architecture. Image developed by the author using Excel.

Loss Functions

While working on regression problems, we need not use any activation function for the output layer. The loss function to use during training a regression problem is the mean squared error. However, outliers in the training set can be handled using the mean absolute error instead. Huber loss is also a widely used error function for the regression-based task.

The Huber loss is quadratic when the error is smaller than a threshold t (mostly 1) but linear when the error is larger than t. The linear part allows it to be less sensitive to outliers when compared to the mean squared error, and the quadratic part allows faster convergence and more accurate figures than the mean absolute error.

Classification problems usually work with binary cross-entropy or categorical cross-entropy or sparse categorical cross-entropy. Binary cross-entropy is used for binary classification, whereas categorical or sparse categorical cross-entropy is used for multiclass classification problems. You can find more details about the loss function in the link below.

Note: Categorical cross-entropy is used for a one-hot representation of the dependent variable, sparse categorical cross-entropy is used when labels are provided as integers.

Developing an ANN in Python

We will be using a Credit Data from Kaggle to develop a fraud detection model using Jupyter Notebook. The same can be done in Google Colab as well. The datasets contain transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

import tensorflow as tf
print(tf.__version__)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from sklearn import preprocessingfrom tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve, aucimport matplotlib.pyplot as plt
from tensorflow.keras import optimizers
import seaborn as snsfrom tensorflow import kerasimport random as rnimport os
os.environ["CUDA_VISIBLE_DEVICES"] = "3"
PYTHONHASHSEED=0
tf.random.set_seed(1234)
np.random.seed(1234)
rn.seed(1254)

The dataset consists of the following attributes. Time, Principal Components, Amount, and Class. For more information refer to the Kaggle website.

file = tf.keras.utils
raw_df = pd.read_csv(‘https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv')
raw_df.head()

Since most of the attributes are principal components, the correlation will always be 0 (property of orthogonal vectors in principal components). The only column with the possibility of an outlier is the amount. A quick description of the same provides the statistics outlined below.

count    284807.00
mean 88.35
std 250.12
min 0.00
25% 5.60
50% 22.00
75% 77.16
max 25691.16
Name: Amount, dtype: float64
Figure 2. Illustrates the correlation matrix of all attributes present in the data. Image developed by the Author using Jupyter Notebook.

Outliers can be crucial towards detecting frauds as the underlying hypothesis being, higher transaction could be an indication of fraud activity. However, the boxplot doesn’t reveal any specific trend to validate the above hypothesis.

Figure 3. Illustrates the boxplot representation of amount by Fraudulent and non-Fraudulent activities. Image developed by the Author using Jupyter Notebook.

Preparing Input-Output & Train-Test data

X_data = credit_data.iloc[:, :-1]y_data = credit_data.iloc[:, -1]X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.2, random_state = 7)X_train = preprocessing.normalize(X_train)

The amount and the PCA variables use different scales, hence the dataset is normalized. Normalization plays an important role in gradient descent. Convergence is much faster on Normalized Data.

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

The Output:

(227845, 29) #--Number of records x Number of columns
(56962, 29)
(227845,)
(56962,)

Developing the ANN Layer

The above output suggests that we have 29 independent variables to work with, hence the shape of the input layer is 29. The general structure of any ANN architecture is outlined below.

+----------------------------+----------------------------+
| Hyper Parameter | Binary Classification |
+----------------------------+----------------------------+
| # input neurons | One per input feature |
| # hidden layers | Typically 1 to 5 |
| # neurons per hidden layer | Typically 10 to 100 |
| # output neurons | 1 per prediction dimension |
| Hidden activation | ReLU, Tanh, sigmoid |
| Output layer activation | Sigmoid |
| Loss function | Binary Cross Entropy |
+----------------------------+----------------------------+
+-----------------------------------+----------------------------+
| Hyper Parameter | Multiclass Classification |
+-----------------------------------+----------------------------+
| # input neurons | One per input feature |
| # hidden layers | Typically 1 to 5 |
| # neurons per hidden layer | Typically 10 to 100 |
| # output neurons | 1 per prediction dimension |
| Hidden activation | ReLU, Tanh, sigmoid |
| Output layer activation | Softmax |
| Loss function | "Categorical Cross Entropy |
| Sparse Categorical Cross Entropy" | |
+-----------------------------------+----------------------------+

Inputs to the Dense Function

  1. units — Dimension of the output
  2. activation — Activation function, if not specified nothing is used
  3. use_bias — Boolean stating if the layer uses a bias vector
  4. kernel_initializer — Initializer for the kernel weights
  5. bias_initializer — Initializer for the bias vector.
model = Sequential(layers=None, name=None)
model.add(Dense(10, input_shape = (29,), activation = 'tanh'))
model.add(Dense(5, activation = 'tanh'))
model.add(Dense(1, activation = 'sigmoid'))
sgd = optimizers.Adam(lr = 0.001)model.compile(optimizer = sgd, loss = 'binary_crossentropy', metrics=['accuracy'])

Summary of the Architecture

model.summary()Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 10) 300
_________________________________________________________________
dense_1 (Dense) (None, 5) 55
_________________________________________________________________
dense_2 (Dense) (None, 1) 6
=================================================================
Total params: 361
Trainable params: 361
Non-trainable params: 0
_________________________________________________________________

Let’s try and understand the output above (Output explanation is provided using two hidden layers):

  1. We have created a neural network with one input, two hidden, and an output layer
  2. The input layer has 29 variables and 10 neurons. So the Weight matrix will be of shape 10 x 29 and the bias matrix is of shape 10 x 1
  3. Total number of Parameters in layer 1 = 10 x 29 + 10 x 1 = 300
  4. The first layer has 10 output values using tanh as the activation function. The second layer has 5 neurons and works with 10 inputs, hence the weight matrix is 5 x 10 and the bias matrix is 5 x 1
  5. Total Parameters in layer 2 = 5 x 10 + 5 x 1 = 55
  6. Finally, the output layer has one neuron but it has 5 different inputs from the hidden layer 2 and has a bias term, hence the number of neurons = 5+1=6
model.fit(X_train, y_train.values, batch_size = 2000, epochs = 20, verbose = 1)Epoch 1/20
114/114 [==============================] - 0s 2ms/step - loss: 0.3434 - accuracy: 0.9847
Epoch 2/20
114/114 [==============================] - 0s 2ms/step - loss: 0.1029 - accuracy: 0.9981
Epoch 3/20
114/114 [==============================] - 0s 2ms/step - loss: 0.0518 - accuracy: 0.9983
Epoch 4/20
114/114 [==============================] - 0s 2ms/step - loss: 0.0341 - accuracy: 0.9986
Epoch 5/20
114/114 [==============================] - 0s 2ms/step - loss: 0.0255 - accuracy: 0.9987
Epoch 6/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0206 - accuracy: 0.9988
Epoch 7/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0174 - accuracy: 0.9988
Epoch 8/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0152 - accuracy: 0.9988
Epoch 9/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0137 - accuracy: 0.9989
Epoch 10/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0125 - accuracy: 0.9989
Epoch 11/20
114/114 [==============================] - 0s 2ms/step - loss: 0.0117 - accuracy: 0.9989
Epoch 12/20
114/114 [==============================] - 0s 2ms/step - loss: 0.0110 - accuracy: 0.9989
Epoch 13/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0104 - accuracy: 0.9989
Epoch 14/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0099 - accuracy: 0.9989
Epoch 15/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0095 - accuracy: 0.9989
Epoch 16/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0092 - accuracy: 0.9989
Epoch 17/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0089 - accuracy: 0.9989
Epoch 18/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0087 - accuracy: 0.9989
Epoch 19/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0084 - accuracy: 0.9989
Epoch 20/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0082 - accuracy: 0.9989

Evaluating Output

X_test = preprocessing.normalize(X_test)results = model.evaluate(X_test, y_test.values)1781/1781 [==============================] - 1s 614us/step - loss: 0.0086 - accuracy: 0.9989

Analyzing Learning Curves using Tensor Board

TensorBoard is a great interactive visualization tool that can be used to view the learning curves during training, compare learning curves across multiple runs, analyze training metrics and many more. This tool is installed automatically with TensorFlow.

import os
root_logdir = os.path.join(os.curdir, “my_logs”)
def get_run_logdir():
import time
run_id = time.strftime(“run_%Y_%m_%d-%H_%M_%S”)
return os.path.join(root_logdir, run_id)
run_logdir = get_run_logdir()tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)model.fit(X_train, y_train.values, batch_size = 2000, epochs = 20, verbose = 1, callbacks=[tensorboard_cb])%load_ext tensorboard
%tensorboard --logdir=./my_logs --port=6006
Figure 4. Illustrates the Tensor Board output of the ANN run. Image developed by the Author using Jupyter Notebook.

Hyper-tuning Model Parameters

As stated earlier, there are no predefined rules on how many hidden layers or how many neurons are best suited for a problem space. We can use a RandomizedSearchCV or a GridSearchCV to hyper tune a few parameters. Parameters that can be finetuned are outlined below:

  • Number of Hidden Layers
  • Neurons in Hidden Layers
  • Optimizer
  • Learning Rate
  • Epoch

Declaring Function to Develop the Model

def build_model(n_hidden_layer=1, n_neurons=10, input_shape=29):

# create model
model = Sequential()
model.add(Dense(10, input_shape = (29,), activation = 'tanh'))
for layer in range(n_hidden_layer):
model.add(Dense(n_neurons, activation="tanh"))
model.add(Dense(1, activation = 'sigmoid'))

# Compile model
model.compile(optimizer ='Adam', loss = 'binary_crossentropy', metrics=['accuracy'])

return model

Using Wrapper Class to Clone the Model

from sklearn.base import clone

keras_class = tf.keras.wrappers.scikit_learn.KerasClassifier(build_fn = build_model,nb_epoch = 100,
batch_size=10)
clone(keras_class)
keras_class.fit(X_train, y_train.values)

Creating a RandomizedSearch Grid

from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV
param_distribs = {
“n_hidden_layer”: [1, 2, 3],
“n_neurons”: [20, 30],
# “learning_rate”: reciprocal(3e-4, 3e-2),
# “opt”:[‘Adam’]
}
rnd_search_cv = RandomizedSearchCV(keras_class, param_distribs, n_iter=10, cv=3)rnd_search_cv.fit(X_train, y_train.values, epochs=5)

Checking the best Parameter

rnd_search_cv.best_params_{'n_neurons': 30, 'n_hidden_layer': 3}rnd_search_cv.best_score_model = rnd_search_cv.best_estimator_.model

The optimizer should also be finetuned as they impact gradient descent, convergence, and automatic adjustment of learning rates.

  • Adadelta — Adadelta is a more robust extension of Adagrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients
  • Stochastic Gradient Descent — Used frequently. Requires learning rates to be finetuned using a search grid
  • Adagrad — The learning rate is constant for all parameters and each cycle for other optimizers. Adagrad however changes the learning rate ‘η’ for each parameter and at every time step ‘t’ as it works on the derivative of an error function
  • ADAM — Adam (Adaptive Moment Estimation) works with the momentum of first and second-order to prevent jump over local minima. Adam keeps an exponentially decaying average of past gradients
Figure 5. Ilustrates the convergence across different optimizers. Image from GIPHY.

In general one can achieve better output by increasing the number of layers instead of the number of neurons per layer.

Reference

Aurélien Géron (2017). Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools, and techniques to build intelligent systems. Sebastopol, Ca: O’reilly Media