Understanding CNN (Convolutional Neural Network)

Source: Deep Learning on Medium

Consider this image. Do you see a young lady or a grandma? If you start your focus on a dot in the middle of an image, you would see a young lady. However, if you focus on the black strip in the middle-bottom of the image, you will see an old lady. Look at the red boxes on the image.

This picture provides an insight on how humans recognize images. Because the human brain is designed to capture patterns in order to classify an object, changing the points where you focus your observation also changes your interpretation of the overall image.

Similar to how the human brain works, CNN distinguishes meaningful features in an image in order to classify the image as a whole.

Principles of CNN


A convolution sweeps the window through images then calculates its input and filter dot product pixel values. This allows convolution to emphasize the relevant features.

1D Convolution Operation with features(filter)

Look at this input. We will encase the window elements with a small window, dot multiplies it with the filter elements, and save the output. We will repeat each operation to derive 5 output elements as [0,0,0,1,0]. From this output, we can know that the feature change(1 becomes 0) in sequence 4. The filter has done well to identify the input values. Similarly, this happened for 2D Convolutions as well.

2D Convolution Operation with features(filter) — Source

With this computation, you detect a particular feature from the input image and produce feature maps (convolved features) which emphasizes the important features. These convolved features will always change depending on the filter values affected by the gradient descent to minimize prediction loss.

Furthermore, The more filters deployed, the more features that CNN will extract. This allows more features found but with the cost of more training time. There is a sweet spot for the number of layers, usually, I will put 6 for 150 x 150 size of image.

Feature map in each layer of CNN (source)

However, what about the corner or side values. They do not have enough adjacent blocks to fit the filter. Should we remove them?

No, because you would lose important information. Therefore, what you want to do instead is padding; you pad the adjacent feature map output with 0. By inserting 0 to its adjacent, you no longer need to exclude these pixels.

Essentially, these convolution layers promote weight sharing to examine pixels in kernels and develop visual context to classify images. Unlike Neural Network (NN) where the weights are independent, CNN’s weights are attached to the neighboring pixels to extract features in every part of the image.

Max Pooling

We take the maximum max pooling slices of each 2×2 filtered areas (source)

CNN uses max pooling to replace output with a max summary to reduce data size and processing time. This allows you to determine features that produce the highest impact and reduces the risk of overfitting.

Max pooling takes two hyperparameters: stride and size. The stride will determine the skip of value pools while the size will determine how big the value pools in every skip.

Activation Function (ReLU and Sigmoid)

After each convolutional and max pooling operation, we can apply Rectified Linear Unit (ReLU). The ReLU function mimics our neuron activations on a “big enough stimulus” to introduce nonlinearity for values x>0 and returns 0 if it does not meet the condition. This method has been effective to solve diminishing gradients. Weights that are very small will remain as 0 after the ReLU activation function.

The CNN Big Picture + Fully Connected Layer

CNN architectures with convolutions, pooling (subsampling), and fully connected layers for softmax activation function

Finally, we will serve the convolutional and max pooling feature map outputs with Fully Connected Layer (FCL). We flatten the feature outputs to column vector and feed-forward it to FCL. We wrap our features with softmax activation function which assign decimal probabilities for each possible label which add up to 1.0. Every node in the previous layer is connected to the last layer and represents which distinct label to output.

The end results? You will be able to classify the dogs and cat images as below.

Finding the perfect image classification with softmax (Source)

Cleaning and Preventing Overfitting in CNN

Unfortunately, CNN is not immune to overfitting. If not monitored properly, the model can get trained too much that it could not generalize unseen data. Through my experiences, I have made many beginner overfitting mistakes and how I resolve them as following:

Using test set as the validation set to test the model

Even though we do not use the test set to train the model, the model could adjust the loss function with the test set. This will base the training on the test dataset and is a common cause of overfitting. Therefore, during the training, we need to use validation sets then ultimately test the finished model with the unseen test set.

Dataset is relatively small

When dataset is small, it is very easy to specialize onto a few set of rules and forget to generalize. For example, if your model only sees boots as shoes, then the next time you show high heels, it would not recognize them as shoes.

Therefore, in the case of small training data set, you need to artificially boost the diversity and number of training examples. One way of doing this is to add image augmentations and creating new variants. These include translating images and creating dimension changes such as zoom, crop, flips, etc.

Image augmentation Source

Over Memorization

Too many neurons, layers, and training epochs promote memorization and inhibit generalize. The more you train your model, the more likely it becomes too specialized. To counter this, you could reduce the complexity by removing a few hidden layers and neurons per layer.

Alternatively, you could also use regularization techniques such as Dropout to remove activation unit in every gradient step training. Each epoch training deactivates different neurons.

Since the number of gradient steps is usually high, all neurons will averagely have same occurrences for dropout. Intuitively, the more you drop out, the less likely your model memorizes.

Drop out images

Dealing with color images

You can also easily include images with 3 layers of color channels: Red Green Blue (RGB). During convolution, you use 3 separate convolutions for each color channel and train 3-level stack of filters. This allows you to retrieve 3D feature maps.

How could we do better? — Transfer Learning

As the use cases become complex, the complexity of the model needs to improve as well. With a few layers of CNN, you could determine simple features to classify dogs and cats. However, at the deep learning stage, you might want to classify more complex objects from images and use more data. Therefore, rather than training them yourself, transfer learning allows you to leverage existing models to classify quickly.

Transfer learning is a technique that reuses an existing model to the current model. You could produce on top of existing models that were carefully designed by experts and trained with millions of pictures.

However, there are a few caveats that you need to follow. First, you need to modify the final layer to match the number of possible classes. Second, you will need to freeze the parameters and set the trained model variables to immutable. This prevents the model from changing significantly.

One famous Transfer Learning that you could use is MobileNet. It is created for mobile devices which have less memory and computational resources. You can find MobileNet in Tensorflow Hub which gathers many pretrained models. You can just simply add your own FCL Layer on top of these models.

Conclusion: CNN to perceive our visual world

CNN is a tough subject but a rewarding technique to learn. It teaches us how we perceive images and learn useful applications to classify images and videos. After learning CNN, I realized that I could use this for my project at Google to detect phishing attacks.

I also realized that the knowledge for CNN is very deep. Over the years, there are many improvements in CNN variations including one of the latest — ResNet — which even beats human reviewers in ImageNet Classifications.

  1. Le-Net (Yann Le Cun, 1998)
  2. Alex Net (2012)
  3. VGGNet (2014) — Deep neural network
  4. Inception Module Google Net (2014) — Stack module Layer
  5. ResNet (2015) — First net to outperform human imagenet

For me, I am writing this article to explore my basic understanding of CNN for a project I work at Google. Therefore, feel free to give me any feedback if I made any mistakes or knowledge gaps in my writing. Soli Deo Gloria.


I sincerely hope this pique your interest to learn deeper about CNN. If you do, here are some resources which you might find very useful:

  1. Yann LeCun’s paper of CNN
  2. CS 231 Stanford
  3. Google ML CNN
  4. And many others


I really hope this has been a great read and a source of inspiration for you to develop and innovate.

Please Comment out below for suggestions and feedback. Just like you, I am still learning how to become a better Data Scientist and Engineer. Please help me improve so that I could help you better in my subsequent article releases.

Thank you and Happy coding 🙂

About the Author

Vincent Tatan is a Data and Technology enthusiast with relevant working experiences from Google LLC, Visa Inc. and Lazada to implement microservice architectures, business intelligence, and analytics pipeline projects.

Vincent is a native Indonesian with a record of accomplishments in problem-solving with strengths in Full Stack Development, Data Analytics, and Strategic Planning.

He has been actively consulting SMU BI & Analytics Club, guiding aspiring data scientists and engineers from various backgrounds, and opening up his expertise for businesses to develop their products.

Vincent also opens up his 1 on 1 mentorship service on 10to8 to coach how you can land your dream Data Scientist/Engineer Job at Google, Visa or other large tech companies.

  1. Please inform him if you need referrals to Google. Google is hiring!
  2. Book your appointment with him here if you are looking for mentorship.

Lastly, please reach out to Vincent via LinkedIn, Medium or Youtube Channel