Building an Artificial Neural Network (in less than 10 minutes)

Source: Deep Learning on Medium

In this ANN model that we’ll be looking at, I used the Rectifier and the Sigmoid function. How did I use both? Here’s the intuition:
Since my output variable is binary, I use the Rectifier function to classify that in my hidden layers, and then I use the Sigmoid function to determine the probability of whether the output will 1 or 0.
The output value and the predicted value will generally be differentiated by a cost function (error).

The goal is to minimize the loss function (cost) since this would bring the predicted value closer to the actual value. This is usually done by changing the weights of the input variables. Sometimes it can take a lot of time and computational power to calculate the actual or global cost function, and it makes sense to use a gradient descent approach to make this process much faster.

A Gradient Descent takes the lowest point of the cost function

A Gradient descent uses the slope of a loss function at a certain point and tries to move downwards to find the lowest point of the function. However, if my function is not convex (with higher degrees freedom), I could end up at a local minimum rather than the global minimum of the function, and the network wouldn’t be as efficient.

Therefore, I use the stochastic gradient descent method, which runs the function for each and every row and keeps updating the minimum of the cost function. This way, I have a higher chance of finding the global minimum. It is also actually faster than the gradient function since it is running smaller algorithms.

II. || The problem ||

This is a dataset of a firm’s customers, it’s not a real dataset but resembles real-like data. The aim is to build an ANN to predict whether a customer will leave the company or not given certain demographic characteristics such as age, gender, salary, credit score, whether they are active or not, etc.

A subset of my Dataset for this project

Thus, we can classify this problem as a demographic segmentation model.
Such a model could be used to predict anything, not just customer churn. You could also try to predict things like whether the customer should get a loan, or if the customer is more likely to purchase a new product. The only change would be to relabel the variables so that we’re predicting the right thing.
Below, you can see a workflow of the ANN model I created in this project. This includes all the steps required to build such a model. After you learn it, you can refer to this diagram to help you remember the steps.

III. || Importing Data and Preprocessing ||

As you saw in the previous section, the dataset has 10 Columns (demographic parameters), and 10000 rows (customers). I selected the X (input variables) variables based on their importance in predicting the y variable (output variable), which is the last column “Churn”.
Note: in the “Churn” column, 1 is if a customer left the firm (is no longer a customer) and 0 refers to if they chose to stay with the firm.
For the X values, I chose columns 2–9, ranging from credit score to estimated salary, since these could actually be important predictors for why a customer chose to leave or stay with the firm. The column I excluded was Customer ID (since it wouldn’t have any effect on the output).

[3]: # Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [4]: # Importing the dataset
df = pd.read_csv(‘Churn_Customers.csv’)
X = df.iloc[:, 1:9].values
y = df.iloc[:, 9].values
In [5]: dataset.head(3)
| 15634| 619 |France |Female|42 | 0.00 | 1 | 1 |101348| 1 |
| 15647| 608 | Spain |Female|41 | 83807 | 0 | 1 |112542| 0 |
| 15619| 502 | France|Female|42 |159660 | 1 | 0 |113931| 1 |
In [7]: X
Out[7]: array([[619, ‘France’, ‘Female’, …, 1, 1, 101348.88],
[608, ‘Spain’, ‘Female’, …, 0, 1, 112542.58],
[502, ‘France’, ‘Female’, …, 1, 0, 113931.57],
[709, ‘France’, ‘Female’, …, 0, 1, 42085.58],
[772, ‘Germany’, ‘Male’, …, 1, 0, 92888.52],
[792, ‘France’, ‘Female’, …, 1, 0, 38190.78]], dtype=object)
In [8]: y
Out[8]: array([1, 0, 1, …, 1, 1, 0])

For mobile users: The code is easier to understand on a desktop. The output of dataset.head(3) is the same as the subset table picture in section II.
X and Y are just arrays of the Predictor and Result variables.

III.i ~ Encoding categorical variables

Photo by Andrew Butler on Unsplash

To intuitively think about it, categorical variables are inherently different from numerical variables. However, Python will read a categorical variable as a string and will exclude such variables in making calculations.
Thus, we need to encode these variables as numerical in order for the machine to understand and use them in the model.

In this model, There are only 2 categorical variables:
– Country (France, Spain, Germany)
– Gender (Male or Female).

Below, I use two classes from sklearn: —

  • Labelencoder: Which converts the string into a numerical label as 0, 1, 2 etc. The way it works is that it creates separate columns for each variable and assigns them a value of 0 or 1 to denote whether they are male/female or not (0 = no, 1 = yes)
  • OneHotEncoder: I use this class to tell the machine that my categorical variables are not ordered. In some cases, if you would encode something like “Large, Medium, Small”, those would be ordered categories. However, in this case, the categories follow no order and I have to specify that using the OneHotEncoder.
    Note: it is also super important to keep the dummy variable trap in mind. Normally, I would remove one of the columns to avoid getting caught in a dummy variable trap. This usually happens because the machine knows that if X0 = 0, then X1 has to be 1. It is a form of redundancy, and so to make the model better suited, I’ll add one line of code to remove the first column.
In [9]: # Encoding the Independent Variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# This line is for encoding the Geography
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
# This line is for encoding the Gender
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
#one-hot encoding the columns
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
In [10]: X = X[:, 1:] #Avoiding the dummy variable trap
In [11]: X[:, 0:10]
Out[11]: array([[ 0., 0., 619., …, 1., 1., 1.],
[ 0., 1., 608., …, 1., 0., 1.],
[ 0., 0., 502., …, 3., 1., 0.],
[ 0., 0., 709., …, 1., 0., 1.],
[ 1., 0., 772., …, 2., 1., 0.],
[ 0., 0., 792., …, 1., 1., 0.]])

III.ii ~ Splitting the Data into Train and Test sets

The lines of code below show the shape of X_train, X_test, y_train, and y_test. As you can see, the data was split with a test size of 0.2, which means that Training sets have 8000 points of data, while Test sets have 2000.

In [12]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
In [13]: X_train.shape #shape of X_train
Out[13]: (8000, 11)
In [14]: X_test.shape #shape of X_test
Out[14]: (2000, 11)
In [15]: y_train.shape #shape of y_train
Out[15]: (8000,)
In [16]: y_test.shape #shape of y_test
Out[16]: (2000,)

III.iii ~ Feature Scaling

Photo by James Pond on Unsplash

Now that the data has been fit into training and test sets, I will feature scale the data manually, so that the machine doesn’t anchor on higher values and give us a biased prediction.
For example, Salary is a higher number than age, which can cause the machine to put more weight on Salary in the model. Therefore, we want to scale all values between -1 and 1, so it is comprehensible to the model.

In [17]: from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

IV. || Importing Keras and Libraries for the ANN ||

In this next section, I’m going to import Keras, which is the most important package I need for my ANN. I’ll use two classes that will help us define the ANN throughout the next section: Sequential and Dense. I won’t go into much detail about these classes, as the code will show what they are doing.

Note: This is an old version of Keras that runs with backend Tensorflow 1. So if you installed Tensorflow 2, you can either downgrade or see the documentation to learn about any changes.

In [18]: #Importing Keras & classes
import keras
from keras.models import Sequential
from keras.layers import Dense
>>> Using TensorFlow backend.

V. || Building the ANN ||

The following steps will show how I went about building this artificial neural network:

  • Step 1: I will start by initializing the ANN using the sequential class
  • Step 2: I will add the input layer along with the first hidden layer.
  • Step 3: I will add another second hidden layer
  • Step 4: Now I will add the output layer.
  • Step 5: After adding the layers, I will compile the ANN model
  • Step 6: Finally, I will fit the ANN model to the training set. The model will then train itself based on the number of epochs I mention.
  • Step 7: Evaluation: I will create a predictor variable and a confusion matrix to evaluate the results predicted by the machine and compare them with the actual results.

V.i ~ How to mathematically create an ANN:

  1. Randomly initialize the weights to small numbers close to 0 (but not 0).
  2. Input the first observation of your dataset in the input layer. each feature in one input node.
  3. Forward-Propagation: from left to right. the neurons are activated in a way that the impact of each neuron’s activation is limited by the weights. Propagate the activations until getting the predicted result y.
  4. Compare the predicted result to the actual result. Measure the generated error.
  5. Back-Propagation: from right to left, the error is back-propagated. Update the weights according to how much they are responsible for the error. The learning rate decides by how much we update the weights.
  6. Repeat Steps 1 to 5 and update the weights after each observation (Reinforcement Learning). Or: Repeat Steps 1 to 5 but updates the weights only after a batch of observations (Batch Learning).
  7. When the whole training set passed through the ANN. that makes an epoch. Redo more epochs.

V.ii ~ Initialization

Photo by Niels And Marco on Unsplash

So there are actually 2 ways of initializing a model: either with Sequential Layers, like I did above, or the other method is to do it by a graph. The step below is essentially initializing the model as a sequence of layers.
I create the object,Classifier which is basically the Artificial Neural Network that I’m about to build.

In [19]: #Initializing the Artificial Neural Network
classifier = Sequential()

V.iii ~ Adding the input layer and the first hidden layer

In the steps below, I used the add method of the object to include the Dense class in the classifier object. Dense is essentially what is allowing us to create the layers for the model.
Now, upon inspecting the Dense class, I can see there are a number of parameters, but as the mathematical steps above show us, I know already which parameters to input for the model.
So I will use the following for the input layer and the first hidden layer:

output_dim (output dimensions):
This is simply the number of nodes I want to add in the hidden layer. I had previously learned that there is no right answer to this as experimentation can allow us to choose the right number of nodes, however, in this project, I took the average sum of the number of input and output layers, (8 + 1)/2 = 4.5 and round it off to 5.
init (random initialization):
This is the first step of the stochastic gradient descent. I need to initialize the weights to small numbers close to 0. The default value for this parameter is given as "glorot_uniform", but for simplification, I will use the "uniform" function, which will initialize the weights according to a uniform distribution.
As the name suggests, this is the activation function. In the first hidden layer, we want to use the rectifier activation function as I had mentioned in the introduction and that’s why I input relu in this parameter.
input_dim (input dimensions):
This is the number of nodes in the input layer, which I already know is 8.

In [20]: #Adding the input layer and a hidden layer
classifier.add(Dense(output_dim = 5, init = ‘uniform’, activation = ‘relu’,
input_dim = 8))

V.iv ~ Adding the second hidden layer

For this hidden layer, I use the add method on the classifier object again.
Using the dense function, I have a similar line of code, but the only difference is that this time there’s no need to specify the number of input layers since the model already knows how many layers to expect as I have already added the input layer to the model.

In [27]: #Adding second hidden layer
classifier.add(Dense(output_dim = 5, init = ‘uniform’, activation = ‘relu’))

V.v ~ Adding the output layer

Photo by Erik Mclean on Unsplash

The final layer that we need to code into the model is the output layer. This process will again use the same add method with the Dense class.

However, this time the number of nodes is changed to 1 since there is only one binary output variable (1 or 0) in this layer, it will only have 1 node.

The activation function will also change to ‘Sigmoid’ since we want to determine the probability that this output function will 1 or 0.

In [21]: #Adding output layer
classifier.add(Dense(output_dim = 1, init = ‘uniform’, activation = ‘sigmoid’))
/Users/rohangupta/anaconda3/lib/python3.7/site-packages/ UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(activation=”sigmoid”, units=1, kernel_initializer=”uniform”)` ~ Compiling the ANN model

This time I use the Compile method on the classifier object and I input the following parameters:

  • Optimizer: This the algorithm I want to use to find the optimal set of weights for the ANN model. The model’s layers have been built, but the weights have only been initialized. Therefore, it is important to use an optimizer to find the right combination of weights. Adam is one of the stochastic gradient descent algorithms, and that is the one I will use to find the optimal set of weights for this model.
  • Loss: This corresponds to the loss function within the Stochastic gradient descent algorithm.
    The basic idea of this is that we need to optimize this loss function within the algorithm to find the optimal weights. For example, in linear regression, I use the sum-of-squares loss function to optimize the model. However, for the stochastic gradient descent, I use a logarithmic function known as binary_crossentropy (Binary Cross-Entropy) since we have a binary output layer.
  • Metrics: Just the criterion metric I use to evaluate the model. I can use the accuracy model (which sees correct predictions over total predictions). So, I input 'accuracy' in the metrics parameter. Since this is expecting a list, I would have to put it in square brackets.
In [22]: #Compiling the artificial neural network
classifier.compile(optimizer = ‘adam’, loss = ‘binary_crossentropy’,
metrics = [‘accuracy’])

V.vii~ Fitting the ANN model to the training set

Photo by Markus Spiske on Unsplash

Now I will fit the model to the training dataset and will run the model to a certain number of epochs.
I start by using the fit method to fit the classifier model to X_Train and y_train. Then, I add two more parameters, which are the batch size and the number of epochs. If you look back at the beginning of this section, steps 6 and 7 refer to these parameters.

In step 6, we can choose to update the weights after every observation or every batch. So for this step, I’ll use batches of 10 to update the weights.

Step 7 tells us that we need to pass the whole training set to more than just 1 epoch. Epoch refers to one round of the entire dataset going through the ANN. I chose 100 epochs for this as choosing these values can be an experimentative process.

In [23]: #Fitting artifical neural network to the training set, y_train, batch_size = 10, nb_epoch = 100)
Epoch 1/100
8000/8000 [==============================] — 2s 279us/step — loss: 0.4822 — acc: 0.7956
Epoch 98/100
8000/8000 [==============================] — 2s 277us/step — loss: 0.3866 — acc: 0.8412
Epoch 99/100
8000/8000 [==============================] — 2s 245us/step — loss: 0.3829 — acc: 0.8405
Epoch 100/100
8000/8000 [==============================] — 2s 259us/step — loss: 0.3767 — acc: 0.8420
Out[23]: <keras.callbacks.History at 0x1a28680cc0>

VI. || Predicting Results & Evaluating the Model ||

Now the model has already run, and I will create a variable, y_pred to store the machine’s predictions. For this, I used the Predict method on the X_test dataset to get values corresponding to y_test.

In [24]: # Predicting the Test set results
y_pred = classifier.predict(X_test)
Out[24]: array([[0.29425734],
[0.2981767 ],
[0.6088282 ],
[0.750462 ]], dtype=float32)

Now, the y_pred variable above shows values between 0 and 1. This is because I used the sigmoid function and the prediction function gives us the probabilities of whether a customer left or not. When in fact, I want binary values such as 0 or 1, True or false, yes or no. These would tell us whether a customer left the firm or chose to stay according to my prediction.

In [25]: #Converting probabilities into a binary result
y_pred = (y_pred > 0.5)
Out[25]: array([[False],
[ True],
[ True]])

The output above makes a lot more sense. I basically used the code y_pred > 0.5 so that the model would tell us if it is true or false depending on whether the probability of the customer leaving was above or below 50%.
Note: It is True if the customer left the firm and False if the customer chose to stay.

VI.i ~ Making the Confusion Matrix

A Confusion Matrix

Finally, I’m going to create a confusion matrix to evaluate the model. This will tell us how many incorrect and correct predictions the model had. The matrix will be a 2X2 box.

The image on the left shows us what the confusion matrix would look like. We get an instant view of the model’s true and false positives and negatives.

In [26]: # Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Out[27]: array([[1538, 57],
[ 241, 164]])
In [28]: TP = 164 #True Positives
TN = 1538 #True Negatives
FP = 241 #False Positives
FN = 57 #False Negatives

VI.ii ~ The accuracy, precision, recall and F1 score of the model.

The descriptions for these metrics are given as the following:

  • Accuracy is as the name goes. It measures the percentage of correct predictions made by the model.
    Sum all True predictions and divide by the total number of predictions.
  • Precision is the closeness of two or more measurements to each other.
    Divide True Positives by Total Positives.
  • Recall is the ratio of correctly predicted positive observations to all observations in the actual class.
    Divide True Positives by the sum of True Positives and False Negatives.
  • F1 Score is the weighted average of Precision and Recall. F1 Score might be a better measure to use if I need to seek a balance between Precision &Recall, AND if there is an uneven class distribution.
    Multiply Precision and Recall, divide the result by the sum of Precision and Recall, and then multiply the final result by 2.
In [29]: #Accuracy
Accuracy = (TP + TN)/(TP + TN + FN + FP)
Out[29]: 0.851
In [30]: #Precision
Precision = TP / (TP + FP)
Out[30]: 0.4049382716049383
In [31]: #Recall
Recall = TP / (TP + FN)
Out[31]: 0.7420814479638009
In [32]: #F1 Score
F1_Score = 2 * Precision * Recall / (Precision + Recall)
Out[32]: 0.523961661341853

VI.iii ~ Graphing the Evaluation Metrics (Conclusion)

Finally, I will graph each of the metrics calculated above, so I can visualize how the effective the model really is. You’ll use matplotlib for this, so make sure you have imported matplotlib.pyplot.
In section III, I did it this way: import matplotlib.pyplot as plt

In [33]: Eval_Metrics = [Accuracy, Precision, Recall, F1_Score]
Metric_Names = [‘Accuracy’, ‘Precision’, ‘Recall’, ‘F1 Score’]
Metrics_pos = np.arange(len(Metric_Names)), Eval_Metrics)
plt.xticks(Metrics_pos, Metric_Names)
plt.title(‘Accuracy v Precision v Recall v F1 Score of the ANN model’)