Simple Image Classification with CNN

Original article was published by Shraddha Anala on Deep Learning on Medium


1) Data Collection —

You will need an API authentication key from Flickr to access their services through HTTP requests.

You can apply for a non-commercial one here.

Once you have the API key and the secret code, you’re all set to access photos from the Flickr website and download them.

Create a FlickrAPI object and plug in your API key as the first argument and your secret code as the second for authentication.

Next step is to extract the URLs which are stored in ‘url_c’. We’ll download about 960 images for each class and move 10 (random) images from each class to the respective folders, i.e 10 cat images to test/cat folder and 10 owl images to the test/owl folder.

Here we will also use the sleep function from the time module to pause for a while downloading the photos from Flickr. This step is done to be respectful to Flickr and not monopolize its resources.

And also because I noticed that after the continuous download of about 494 images, the Flickr service would throw a timed out error, so I split the task into multiple chunks with a delay of 100 seconds in between.

The urlretrieve method requires you to specify the URL where the image is hosted as well as the complete destination path along with the file name. Since we have a lot of files, I’ve decided to name the image files with a prefix and the count.

Don’t forget to move 10 images each of cats and owls to their respective folders inside your testing directory, so we can evaluate the CNN’s performance later on.

2) CNN Architecture —

We will be using the Keras library with Tensorflow backend to implement our CNN. I’d highly suggest reading the explanation in this tutorial if you are new to convolutional neural networks.

The CNN architecture is significantly different from that of an Artificial Neural Network and if you’re a beginner, it might seem quite complex at first.

Some of the tasks CNNs are typically used for include Image Classification, Image Segmentation, Computer Vision, with applications even in other areas like NLP, Time-Series and Recommender Systems.

It helps to keep in mind the kind of task that our CNN will have to perform and the type of data it has to process, while trying to understand how it works.

Images and photos are structured differently from your normal data stored in CSVs and contain even more information if they are coloured. This means that we will have to first transform photos into arrays of numbers that can be understood by the network.

A greyscale image will comprise of a single channel input of numbers indicating how dark a pixel is or isn’t. Similarly, for a coloured image, there will be 3 channels (R,G,B) of numbers that convey information like colour, hue and saturation to the network.

A CNN is made up of hidden layers that are attached to a fully-connected layer which handles the classification decisions based on the pixel information flowing through the previous layers. We’ll define each layer in further detail below.

Our CNN Architecture will consist of multiple layers:

1) Convolutional Layer:

CNN Architecture. Image from Wikipedia.

This layer is the first to extract information from the image by moving over it, a few pixels at a time and applying the convolutional operation. The output of these pixels will be the dot product of the filter value and the pixel value. This process continues for the rest of the pixels and we end with the entire image convolved to a feature map.

In simple terms, a convolutional layer detects features/information from the image by placing and moving filters consecuetively.

2) Max Pooling:

The Pooling layer reduces the size of this feature map to only retain important information and make the training process easier and quicker. In our CNN, we will be using a Max Pooling layer which will move over the feature map in a similar fashion like the previous filter and only take the maximum value of the cluster over which it is focussing.

3) Fully Connected Layer:

The Fully Connected Layer is the usual neural network and in our case, it is composed of 2 layers; one with 64 nodes and the last one with a single output node for binary classification.

This is a simplified overview explaining all the different layers in our model. The actual model actually (eh) consists of a series of Convolutional and Max Pooling layers connected to a 2-layer neural network.

Here’s the code to build the CNN.

Since our task is to classify images, we will set the loss to ‘binary_crossentropy’ (for 2 classes) & the metric to ‘accuracy’.

3) Image Augmentation —

To achieve practically useful accuracy, CNNs in use are trained on millions of images. We neither have such a huge dataset nor do we possess the computational resources required for such training processes.

Image augmentation helps us increase the scope of our very limited dataset. Already existing photos can be manipulated via techniques such as shearing, rotation, zooming, flipping etc., to generate new samples and extend our dataset without having to download more images.

Below, we will define two objects of the ImageDataGenerator class to augment our existing set of images.

The training dataset will be further augmented with new samples through rescaling, shearing, zooming and horizontal flipping and we will also resize all images to have height and width dimensions of 300.

For the testing dataset, we will only rescale and resize the photos.

4) Training & Evaluation —

All that’s left now is to train the model and evaluate its performance on the test photos.

The training process will take quite a bit of time and once that’s done run the evaluate method on the model to obtain the model’s loss and accuracy scores.

Here are the results of our model which achieved an accuracy of 85% and loss of 2.18 on the test set.

Performance Metrics for CNN.

Keep in mind that these metrics might be a little different for you, based on the 10 images you chose for your testing dataset.