Image Processing using OpenCV, CNN, and Keras backed by Tensor Flow

Original article was published by Chetan Mehta on Deep Learning on Medium


There are various applications of Image processing in computer vision.
Image processing involves manipulating digital images in order to extract additional information. We have seen a lot of evolutions in Computer hardware in the past decade resulting in faster processors and GPUs. That enabled us to solve new and emerging problems using Image processing.

Its applications range from medicine to entertainment, passing by geological processing and remote sensing. Multimedia systems, one of the pillars of the modern information society, rely heavily on digital image processing.

In this article, we will try to solve a simpler problem using Image processing.

Problem Statement

We will try to solve a classical classification problem. Let’s assume we are dealing with the garment industry. After each production unit, we need to validate if the units are ready to sell, which in turn involves identifying any defects in the produced clothes.

Let’s try to identify two types of defects in clothes:

  1. Torn clothes
  2. Dirty clothes

After each production, the idea is to pass the images of the produced clothes to a system which will categorize if it contains one of the mentioned defects.


A digital image is a 2-D matrix of pixels of different values.

All images consist of pixels which are the raw building blocks of images. Images are made of pixels in a grid. A 640 x 480 image has 640 columns (the width) and 480 rows (the height). There are 640 * 480 = 307200 pixels in an image with those dimensions.

Essentially, image processing involves the following basic steps:

  1. Importing image using Image acquisition tools.
  2. Image Pre-processing / Analysing and manipulating images.
  3. Output in which either you can alter an image or make some analysis out of it.

We are going to use the OpenCV library for all the image pre-processing tasks. OpenCV reads data from a contiguous memory location. For the sole purpose of that, we are going to use HDF5 format for reading and writing the image data.

We will briefly touch upon all the needed tools/ libraries for better understanding.

HDF5(Hierarchical Data Formats)

The HDF5 format can be thought of as a file system contained and described within one single file. Think about the files and folders stored on your computer. However in an HDF5 file, what we call “directories” or “ folders” on our computers, is called groups and what we call files on our computer are called datasets.

For our use-case, we will store all the images in the HDF5 format organizing them into different folders based on the type and category that the image belongs to.

OpenCV(Open Source Computer Vision) Library

It is an open-source computer vision and machine learning software library. OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in commercial products.

We will use OpenCV library for resizing the images and creating feature vectors out of it, that can be achieved by converting the image data to numpy arrays.

We will use one of the extensions of Deep Neural Nets named CNN(Convolutional Neural Network) for training the model.

CNN (Convolutional Neural Network)

One of the important aspects of solving any problem using Machine learning is to extract features out of the entity set. In the case of Image processing, the feature set is essentially each pixel which the image is constructed of.

The feature set depends upon the resolution and size of the image.

The number of pixels in One Megabyte depends on the color mode of the picture.

  • In an 8-bit (256 colors) picture, there are 1048576, or 1024 X 1024 pixels in one megabyte.
  • 16-bit (65536 colors) picture, one megabyte contains 524288 (1024 X 512) pixels.
  • 24-bit RGB (16.7 million colors) picture, one megabyte has approximately 349920 (486 X 720) pixels.
  • 32-bit CYMK (16.7 million colors) picture, one megabyte has 262144 (512 X 512) pixels.
  • 48-bit picture, one megabyte has only 174960 (486 X 360) pixels.

CNN works with a simple assumption that, not all pixels are required to identify some features from image.

For the classification problem, we are going to identify the boundaries/edges within images to classify them in one of the mentioned categories. So basically we are going to solve an Edge detection problem using CNN.

Convolutional layers are the major building blocks used in convolutional neural networks.

Convolution is the simple application of a filter to an input that results in an activation. Repeated application of the same filter to an input results in a map of activations called a feature map, indicating the locations and strength of a detected feature in input, such as an image.

The innovation of convolutional neural networks is the ability to automatically learn a large number of filters in parallel specific to a training dataset under the constraints of a specific predictive modeling problem, such as image classification. The result is highly specific features that can be detected anywhere on input images.

To understand CNN we need to understand how convolutions work. Let’s take an image represented as a 5×5 matrix of value, with each cell representing a single pixel. Then you can take a 3×3 matrix and slide a 3×3 window around the image. At each position, the 3×3 matrix visit on the image we matrix multiply it with the values at the current image position.

In a nutshell, convolution works as following.

Image Source[]

The window that moves is called Kernel.
The distance that the window moves each time is called stride.

The goal of a convolutional layer is filtering. As we move over an image we effectively check for patterns in that section of the image. This works because of filters, stacks of weights represented as a vector, which are multiplied by the values output by the convolution.

A typical architecture of CNN involves the following components.

Image Source []

Pooling works similar to convolution, the difference is the function that is applied to the kernel and the image window isn’t linear.

The most common pooling functions are Max pooling and Average pooling. Max pooling takes the max value from the window, while average pooling takes the average of all the values in the window.

RELU is an activation function, that squash the values into a range, typically [0,1] or [-1,1].

Softmax is a probabilistic function that allows us to express our inputs as a discrete probability distribution.


Basic understanding of,

  1. Deep Neural Network
  2. Python
  3. Any ML library preferably Tensor Flow
  4. CNN


The implementation will constitute the following steps:

  1. Collect training data.
  2. Label the data and store it in an HDF5 file format.
  3. Train the model using CNN.

Collect training data

We will use Google image search for finding the images we are looking for. Let’s write a javasript function which will collect the links of the search results.

Search the images in Google image search, then run the following script in the javascript console in your browser. This will store the links of all the images in a text file named urls.txt.

Execute the following python script to save all the images into a local drive, whose links are collected in urls.txt. We will use python’s request module to store the images in a directory.

Run the script with the following CMD line options. —-urls=<PATH_TO_URL_FILE> —-output=<PATH_TO_OUTPUT_DIRECTORY>.

Once we collected the training data, the next step is to pre-process it.

Label the data and store it in an HDF5 file format

The pre-processing steps involve,

  1. Labeling the data – Assuming images of the same category are stored in the same folder. We can have two folders sayhole and dirt. Label the images in hole=0 and dirt=1.
  2. Create train_set , test_set , cross_validation_set from the dataset.
  3. Compress the data, shuffle it and store it into a batch file using OpenCV and HDF5.
  4. Compute the training mean, subtract it from each image, and create one-hot encoding

The following script will execute the steps 1 to 3. As a result, will create an hdf5 file from the training data.

We will consume the hdf5 file generated from the previous script and reduce the training mean from each data point, additionally, create an one-hot encode vector for the labels.

Train the model using CNN

So far we have generated the training data and brought it to the format which can be feed to a training model.

Finally, let’s train the data using CNN for generating the model.

Thank You

I hope you found this article helpful. Thanks for taking out time to read it.
Happy Coding !!