Source: Deep Learning on Medium
Logistic Regression Classifier with a Neural Network mindset to recognize cats.
The idea for writing this article is three folds. First, to solve an interesting problem from start to end. Second, while solving the problem learn the theory, maths & intuition behind it. And third I believe, it is one of the most elegant way to get started with the entire genre of Deep Learning. For me it has to be the first step.
So, let’s dive into the problem straightaway.
You are given a dataset with the following information –
- A training set of m_train images labeled as cat (y = 1) or non-cat (y = 0)
- A test set of m_test images labeled as cat or non-cat
- Each image is of shape (num_px, num_px, 3) where height and width of the image is denoted by num_px, hence it is a square and also 3 denotes the three channels (Red, Green & Blue) or RGB in short.
You will build a simple image-recognition algorithm that can correctly classify pictures as cat or non-cat.
Here onwards we will outline the solution approach, code and discuss the concepts which are relevant to solve this problem.
Now before we go any further let me highlight why I chose this problem of recognizing cat and not cat. I could have as well chosen to classify different breeds in a cat or cat/dog/rest. The reason is very strong.
Neural Networks is a generalized class of Machine Learning algorithms. In theory it has a potential to solve any problem and that’s why it is often called as an Universal Approximation Theorem. Logistic regression is a special case of Neural Networks where it deals with binary classes like 0/1 or cat/non-cat.
The attempt is, if we understand the math, architecture and the implementation behind it, we will be able to confidently extend it to multi-class and by that we will be playing in the field of Neural Networks once and for all.
Also, in this solution we will build the code grounds-up and not use any machine learning library. Once we understand it, it will be lot easier for us to use the libraries to solve more complex problems.
Also, I will use Kaggle to share the dataset, use the Kaggle kernel (notebooks) to code it and will also encourage you all to go and create your own notebooks and play around with this dataset.
The final code will be provided in the last section of this article. Throughout the article section-wise code snippets are provided.
With this expectation, let’s get started by solving this problem.
Step — 1: Importing required packages
We will be coding in Python, and hence we will import Python modules which we will need to solve this problem –
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from scipy import ndimage%matplotlib inline
Instead of going over each of these modules, I will encourage you to look it up on the web if you are not aware of it. Let me pick few which are somewhat specific to this problem.
- h5py : Both our train and test set data (input data) are in .h5 format (HDF5 binary data format). The h5py package is a Pythonic interface to the HDF5. It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
- PIL : The Python Imaging Library (PIL) adds image processing capabilities to your Python interpreter. This library supports many file formats, and provides powerful image processing and graphics capabilities.
Step — 2: Import the Dataset and Reshape it
Number of training examples: m_train = 209
Number of testing examples: m_test = 50
Height/Width of each image: num_px = 64
Each image is of size: (64, 64, 3)
train_set_x shape: (209, 64, 64, 3)
train_set_y shape: (1, 209)
test_set_x shape: (50, 64, 64, 3)
test_set_y shape: (1, 50)
Each line of your train_set_x_orig and test_set_x_orig is an array representing an image. There are in total 209 training images and a 50 test images.
Each image is (64, 64, 3) i.e. height and width both are 64 pixels each. So it’s a square image and it has 3 channels i.e. RGB as it’s a colored image.
Let’s look at few images –
# Example of a cat
index = 24
print ("y = " + str(train_set_y[:, index]) + ", it's a '" + classes[np.squeeze(train_set_y[:, index])].decode("utf-8") + "' picture.")
# Example of a not-cat
index = 100
print ("y = " + str(train_set_y[:, index]) + ", it's a '" + classes[np.squeeze(train_set_y[:, index])].decode("utf-8") + "' picture.")
For convenience, you should now reshape images of shape (num_px, num_px, 3) in a numpy-array of shape (num_px * num_px * 3, 1). After this, our training (and test) dataset is a numpy-array where each column represents a flattened image. There should be m_train (respectively m_test) columns.
We will follow this convention of shaping our data all throughout our application in Deep Learning.
A trick when you want to flatten a matrix X of shape (a, b, c, d) to a matrix X_flatten of shape (b*c*d, a) is to use –
X_flatten = X.reshape(X.shape, -1).T# X.T is the transpose of X
We will use this trick below to our data-
# Reshape the training and test examplestrain_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape, -1).Ttest_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape, -1).Tprint ("train_set_x_flatten shape: " + str(train_set_x_flatten.shape))print ("train_set_y shape: " + str(train_set_y.shape))
print ("test_set_x_flatten shape: " + str(test_set_x_flatten.shape))
print ("test_set_y shape: " + str(test_set_y.shape))# Output ---# train_set_x_flatten shape: (12288, 209)
# train_set_y shape: (1, 209)
# test_set_x_flatten shape: (12288, 50)
# test_set_y shape: (1, 50)
To represent color images, the red, green and blue channels (RGB) must be specified for each pixel, and so the pixel value is actually a vector of these numbers ranging from 0 to 255. They are basically the pixel values.
One common preprocessing step in machine learning is to center and standardize your dataset, meaning that you subtract the mean of the whole numpy array from each example, and then divide each example by the standard deviation of the whole numpy array. But for picture datasets, it is simpler and more convenient and works almost as well to just divide every row of the dataset by 255 (the maximum value of a pixel channel).
train_set_x = train_set_x_flatten/255.
test_set_x = test_set_x_flatten/255.
So, you can see now that “train_set_x” is a matrix of pixel values and “train_set_y” is the a row vector of 0/1 i.e. non-cat/cat.
If you also check the distribution, there are 65 % non-cat and 35 % cat in the training data.
Step — 3 : Understanding the learning algorithm
Let’s design a simple algorithm to distinguish cat images from non-cat images.
Let’s start with the notation –
Here, nx is all the flattened pixel values we saw earlier i.e. 12288 and m is the number of examples (here, training examples) i.e 209. Here, the superscript (1) means for the first image etc.
Here, we will use the Logistic Regression as the output label Y is either 0 or 1 i.e. a binary classification problem. So, let’s see how that algorithm looks like-
If z is very large sigmoid(z) will be close to 1, conversely if z is very small then sigmoid(z) will be close to 0.
So, when you implement logistic regression your job is to try to lean parameters “W” and “b” such that “yhat” becomes a good estimate of the chance of “y” being equal to 1.
Just keep in mind that “w” is nx dimension and “b” is a real number as they have to match dimension of “x” which we have seen above.
Now to train these parameters “W” and “b” we need a cost function or objective function or loss function. We need to define it in such a way that it defines how good “y-hat” is when the true label is “y”. We will in turn choose a function which is “convex” that means which has a global minima where the algorithm converges.
The below explanation will help you understand why did we choose a Loss function of that kind, it is also known as Negative Log-Likelihood function.
If y = 1 : the the loss function L = -logyhat, So, in order to minimize the loss function we would want L to be as small as possible (i.e. large negative in this case) which means you would want logyhat to be as large as possible, which also means you would need yhat to be as large as possible. As yhat is an output from sigmoid the maximum large value could be 1. So, y = yhat = 1 can be obtained by training the parameters through this loss function.
Similarly if y =0 : the the loss function L = -log(1-yhat). So with the same analogy we would want (1-yhat) to be as large as possible, which means yhat to be as small as possible. As yhat can’t be smaller than 0 (due to sigmoid) so the smallest value it could get is 0. So, y = yhat = 0 can be obtained by training the parameters through this loss function.
The final step of this learning algorithm is to Optimize the cost function i.e. we want to find optimal “W” and “b” that minimizes J(W, b). This looks like this in the 3d space.
Cost function J is a convex function. So it’s just a single big bowl as opposed to functions that has many curls which are non-convex and have many local minima. This is one of the reasons why we use this particular cost function, J for logistic regression.
So, now we have to initialize some random values of W and b and optimize it is such a way that the values come to the above red point where the cost function is minimized i.e. J(W, b) is minimized. The technique which we will use to do this optimization is called Gradient Descent.
Below in nutshell I have explained the process of Gradient Descent for one parameter.
Now let’s implement Gradient Descent for Logistic Regression. This is how the earlier equations looks like for the recap –
Now, to perform gradient descent on this, let’s convert it into a computation graph which can be visualized properly-
We need to compute derivative with respect to this loss similarly what we have seen in the Gradient Descent so that we can minimize the loss and get the optimal values of our parameters (W, b).
Now after calculating the derivates, update “W” and “b” so that we reach the global minima of the loss function.
Now, let’s wrap up this section with the formal definition of this algorithm and what we need to do to implement it to recognize the cats. The final algorithm is vectorized which means we are avoiding for loops over all the training examples (from 1 to m) and also the features (from x1 to xp). Vectorizing is achieved by making the matrix operations instead of individual observations. This is how it looks like for one iteration of Gradient Descent (note you still need to perform multiple iteration of Gradient Descent so that the Cost function is minimized) –
Now that we have gone over understanding the algorithm, let’s build it.
Step — 4: General Architecture of this learning algorithm
In the last section we have seen how this algorithm works.
If it was little awkward to understand, believe me it was same for me as well. But I kept trying, first understanding the intuition, then the maths and kept doing it over and over again. At this point it is recommended that you move on to complete this implementation of recognizing cat and then keep coming back to understand it better. Also I would encourage you to read from other resources as well which will make your understanding even more clearer.
With that, let’s visualize this algorithm through a generic architecture. This will help you understand why Logistic Regression is actually a very simple Neural Network –
So, the main steps to build this architecture are –
- Define the model structure (such as number of input features)
- Initialize the model’s parameters
- Within a loop “Calculate current loss (forward propagation)”
- Within the same loop “Calculate current gradient (backward propagation)”
- Within the same loop “Update parameters (gradient descent)”
When the loop ends, it is expected that the Cost Function is minimized and we get the optimal values of our model parameters in our case they are W and b.
Step — 5: Building all the necessary functions for the model
a) Building the helper function –
b) Initializing the parameters –
c) Forward and Backward propagation –
d) Optimization : Updating the parameters using Gradient Descent
e) Predict : Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5)
Step — 6: Training the Model : merge all functions into a model
The output —
The optimal values of our parameters “W” and “b” —
Discuss the Output
Training accuracy is close to 100%. This is a good, our model is working and has high enough capacity to fit the training data.
Test error is 70 %. It is actually not bad for this simple model, given the small dataset we used and that logistic regression is a linear classifier. But as we keep extending this classifier with more sophisticated algorithms, we will see that we do lot better in predicting the unseen data.
Also, you see that the model is clearly overfitting the training data. Later we will see how to reduce overfitting, for example by using regularization and other important techniques.
If we plot the learning curve —
You can see the cost decreasing. It shows that the parameters are being learned.
Choice of Learning Rate
In order for Gradient Descent to work you must choose the learning rate wisely. The learning rate α determines how rapidly we update the parameters. If the learning rate is too large we may “overshoot” the optimal value. Similarly, if it is too small we will need too many iterations to converge to the best values. That’s why it is crucial to use a well-tuned learning rate.
Let’s compare the learning curve of our model with several choices of learning rates
learning_rates = [0.01, 0.001, 0.0001]
So, what we can see here is –
- Different learning rates give different costs and thus different predictions results.
- If the learning rate is too large, the cost may oscillate up and down. It may even diverge
- A lower cost doesn’t mean a better model. You have to check if there is possibly overfitting. It happens when the training accuracy is a lot higher than the test accuracy.
- In deep learning, it is usually recommended to choose the learning rate that better minimizes the cost function and if your model overfits, use other techniques to reduce overfitting.
Try your own images –
We have pretty much solved the entire thing from the scratch. There is almost no abstraction which I think is very important to get started with any new concepts.
There could be one last thing which we might want to do is to upload some of our own cat/non-cat pics and see how they are being predicted. That will give us the complete sense of freedom.
Here’s the code snippet for tying your own uploaded images which then gets converted to the array.
The entire source code of this solution is here, please make use of it –
In this article we have taken some very important steps and am quire happy and proud to call it out –
- Firstly we took a very interesting and a difficult problem and solved it from start to the end.
- The implementation is from the scratch without using any libraries and abstraction which helps us see things the way they are.
- The way we had built this algorithm it has opened up a complete new class of algorithms called Neural Networks (Deep Learning) which we will see subsequently.
- The way I see this algorithm is, it is a baseline for solving a whole set of problems in Computer Vision. We will definitely make it more accurate by using techniques like regularization, deep learning, convolutional neural networks, hyperparameter tuning, different optimization techniques other than Gradient Descent, using libraries like Tensorflow etc. and many more.
My objective for this article was to solve this interesting problem fearlessly without being too self aware of the theory and the maths behind it. And then learn the theory as we encounter them. I think I am able to do it, hope the readers feel the same and get enriched.
In subsequent articles I will extend this algorithm to make it more accurate and better.
- Deep Learning Specialization by Andrew Ng & team.