Part 1: Image Classification using Features Extracted by Transfer Learning in Keras

Source: Deep Learning on Medium

Part 1: Image Classification using Features Extracted by Transfer Learning in Keras

By Ahmed F. Gad, Alibaba Cloud Community Blog author

Through a series of 4 tutorials, we will explore the classification of images by training an artificial neural network (ANN) using features extracted by transferring the learning of a pre-trained deep learning model (DL), convolutional neural network (CNN), in Keras.

The tutorials start by exploring the machine learning (ML) pipeline to highlight that manually engineering features is a challenging task especially when there are large amounts of data and that automatic feature extraction using transfer learning is the preferred way. After that, an introduction about transfer learning is given that explores the benefits of transfer learning and the conditions of its use. Then the series moves to use Keras running in Jupyter notebook for transfer learning of a pre-trained model (MobileNet) that is trained using the ImageNet dataset for use with another dataset which is the Fruits360 dataset. After that, the features are extracted from the model resulted from transfer learning, the extracted features are analyzed for removing bad features, and finally building an ANN that is trained by such features.

This tutorial, which is Part 1 of the series, explores the ML pipeline to highlight the challenges of manual feature extraction. Then it gives an introduction to transfer learning to understand it and why to we can use it.

The points covered in this tutorial are listed below:

  • Exploring the Machine Learning Pipeline.
  • Manual Feature Engineering.
  • Automating Feature Extraction using Deep Learning.
  • What is Transfer Learning?
  • Why Transfer Learning?
  • Use case in which Transfer Learning is Beneficial
  • Conditions to use Transfer Learning.

Let’s get started.

Exploring the Machine Learning Pipeline

In order to understand the benefits of transfer learning of a pre-trained DL model, the ML pipeline is discussed to understand. This helps to figure out the core benefit of transfer learning. The pipeline of building a machine learning model is shown in the next figure. Not that there might be other steps to be added to the pipeline such as feature reduction but the steps below are sufficient for building a model. Let’s briefly explain each of the steps in the pipeline and focus on the feature engineering step.

Problem definition is all about understanding the problem being solved in order to find the most suitable ML techniques for use. This starts by deciding the scope of the problem. That is whether it is a supervised problem (classification or regression) or an unsupervised problem (clustering). After defining the problem scope, next is to decide which ML algorithms to be used. For example, if it is a supervised problem then which ML algorithm to be used. It is linear or non-linear, parametric or non-parametric, and so on.

Defining the problem serves the next step in the ML pipeline which is data preparation. The machine learns by examples. Each example has inputs and outputs. If the problem is a classification problem for example in which each sample is to be classified to one of the pre-defined categories, then the outputs are labels. If it is a regression problem in which the outputs are represented as a continuous line, then the output is no longer a label but a number. So, by defining the problem we are able to prepare the data in a suitable form.

Manual Feature Engineering

After the data is ready, the next step is feature engineering. It is the most critical step in building traditional machine learning models. At first, why feature engineering? Feature engineering means transforming the data from its current form to another form that serves the problem being solved. How the data is transformed from one form to another? It is using feature descriptors. Talking about computer vision, there are different feature descriptors to transform the image from one form to another. The categories of these descriptors include color, edge, texture, and keypoints descriptors.

There are different types of descriptors in each of these categories. For example, there are gray-level co-occurrence matrix (GLCM) and local binary patterns (LBP) in the texture descriptors. Also, there are scale-invariant feature transform (SIFT), speeded-up robust feature transform (SURF), and Harris as keypoints descriptors. This leaves us to another question. For a given problem, what are the best types of descriptors to be used? This is the trick.

The decision of whether to use a feature descriptor for a given problem or not is done by data scientists manually. It is all about trial and error. Based on the experience of the data scientist in solving the problem in hand, he/she starts suggesting a number of descriptors for use. Based on the selected descriptors, the features are extracted from the images and then we move to the next 2 steps in the ML pipeline which are training the ML algorithm and testing the trained model. Note that the model is the result after training the algorithm.

The selection of the descriptors might not be correct and the test error of the trained model might be large and thus the data scientist has to change such descriptors until finding the best selection for reducing the error. For each new selection of the descriptors, the ML algorithm has to be trained and tested again.

Besides the error, there might be some other factors to be taken in mind when selecting the descriptors such as computational complexity. Sure, it is tiresome to manually select the best descriptors that meet our needs especially for complex types of problems where thousands and even millions of images are to be analyzed.

Example to Select Features

Let’s apply the above discussion for selecting the best types of features for classifying the 3 images given in the next figure given that each one corresponds to a different class. In your opinion, what is the best category of features to be used (color, edge, texture, or keypoints)? It is clear that the colors of these 3 images are different and thus we can use a color descriptor such as the color histogram. This serves the purpose accurately. After building an accurate model, then we can move to the last step in the pipeline which is model deployment.

What if more images are added to the dataset as given below given that each one corresponds to a different class? It is obvious that different images have similar colors and this using only the color histogram might not serve the purpose. Thus, we have to look for other types of descriptors for use.

Assuming that descriptor X is selected and worked well for capturing the differences among the below images. It is possible to use other images where the descriptor X might not be able to differentiate among them. Thus we have to find an alternative to the descriptor X that can capture the differences. The process repeats when more images are added.

The discussion above highlighted that manual feature engineering is tiresome. If manual feature engineering is tedious for a given problem, what is the alternative? It is deep learning or DL for short.

Automating Feature Extraction using Deep Learning

DL is the automation of traditional machine learning where the machine itself decides the best types of features for use. The next figure compared the pipeline of the traditional ML to DL. Rather than feature engineering in the ML pipeline, the human just makes supervision in building the DL architecture in the DL pipeline. After that, training starts for finding the best set of features automatically for reducing the error as much as possible. The DL algorithm used for the recognition of multi-dimensional data such as images is the convolutional neural network (CNN).

DL makes it much easier to find the best types of features for use but you have to take care of something. For the CNN to adapt itself to find the best features automatically, there must be thousands of images for use. For example, MobileNet is a CNN model trained using ImageNet which is the largest image recognition dataset on the earth which includes over 1 million samples. Thus, plenty of data is the driver behind building MobileNet. If such a massive dataset is not available, then MobileNet was not to be created. This opens an important question. If I do not have a large dataset for building a DL model from scratch and also want to save my time from trying different feature descriptors for building a traditional ML model, what should I do for extracting the features automatically? The answer is transfer learning.

You do not have to build a DL model from scratch to use DL. You can use the learning of a pre-trained DL model and transfer it to your own problem. The next section discusses transfer learning.

What is Transfer Learning?

Transfer learning is adaptation more than creation. A model is not created from scratch but a pre-trained model is just adapted to a new problem. Given a small dataset which is not sufficient to build a DL model from scratch, then transfer learning is the option to automatically extract the features. The next figure highlights that.

Before transfer learning, a DL model is trained by a large dataset in which thousands or millions of samples exist. The learning of such a trained DL model is transferred using transfer learning to allow the DL model to work on another small dataset with just hundreds or a few thousands of images.

I receive a question from many people asking whether they can still use deep learning for a dataset with a few numbers of samples. There is no clear cut answer to such a question but what I can say is that the accuracy of the model created from transfer learning increases as the number of samples in the new dataset increases. The new dataset is not needed to be large as the original dataset used for training the DL model but it should include as much as samples. As shown in the next figure, the more samples in the new dataset the more customizations the model gets in order to work on the new dataset. The reason is that for more samples, the pre-trained model parameters receive more customization to the new dataset. As a result, the model resulted from transfer learning will be able to make more accurate predictions compared to a model created with a fewer number of samples.

For getting much knowledge about transfer learning, the next section discusses the reasons for why to use transfer learning.

Why Transfer Learning?

There are a number of reasons to do transfer learning. Here is a list that summarizes some of these important reasons.

  1. Insufficient train and test data for building a model from scratch.
  2. No need for labeling data in order to enlarge the dataset.
  3. Imbalanced data distribution.
  4. Even if the training data is enough, training a DL model from scratch usually requires high processing power and takes much time.
  5. Even if the training data is enough, the test data might not be similar to the training data and some new cases in the test data might be available that are not covered previously in the training data. You have to retrain the model by new samples to cover such new cases.
  6. Building a model from scratch requires researching the problem and deep understanding of how things work.

Let’s discuss these points.

1. Insufficient Train and Test data for Building a Model from Scratch

When a predictive model is to be built, the first task that a ML engineer thinks about is collecting as much data as possible to build an accurate model able to handle the different cases. The machine learning algorithm, the parametric one, has a number of parameters to be learned from the data. In some tasks, there is no enough data that helps the algorithm to learn these parameters correctly.

Transfer learning does not ask for much data because the algorithm will not be trained from scratch to build a model. Rather a pre-trained model is used which learned these parameters previously. Just a small amount of data is needed to adapt the trained model over the problem in hand.

2. No Need for Labeling Data in order to enlarge the Dataset

If the dataset used for training and testing the machine learning algorithm is not large enough to guarantee that the model will reach a solid learning state, some machine learning engineers tend to enlarge such a dataset by different ways. The most preferred way is gathering more realistic samples and labeling them for being used for training the algorithm. Manually labeling the instances is not easily achieved and automatic labeling might not be accurate enough.

In some types of problems, labeling might not be the problem. Labeling the instance comes after the instance itself is available. Some problems have just a limited number of instances and no more instances can be created easily. Talking about medical images, permission must be taken from the patient before using its data for doing experiments and not all patients agree to that. When there is no way to create more instances, some techniques such as image data augmentation might be helpful but it still not serving the purpose too much. This is because the same instance is just transformed (e.g. rotated) to generate more images.

Transfer learning tackles this issue because it is not required to build a model from scratch and thus no much data is needed. Just small amounts of data to just fine-tune the pre-trained model. It is preferred to increase the data used for fine-tuning the model but it is OK if not possible.

3. Imbalanced Data Distribution

The previous point discusses a problem in which the dataset is balanced but all classes have a few numbers of samples. By balanced it is meant that all classes have nearly equal percentages of the data in the entire dataset and no class has noticeably more samples than another.

Some other problems might have a class that has more samples than another. In this case, the dataset is called to have imbalance in the class distribution. As a result, the machine learning model will be highly biased to this class and gives it more importance compared to the other class. The probability of classifying an input sample according to this class label will be higher than the other class. The class of the high ratio of samples is called the majority class where the other class is called the minority class. The engineer has to work with this issue in different ways.

If the minority class has a few numbers of samples but such a number is enough to build the machine learning model, then rather than using all samples in the majority class, just an equal ratio of samples are selected from it to have a balanced dataset.

If the minority class has a few numbers of samples and such a number if is not enough to build the machine learning model, then some new samples in the minority class must be added. The most preferred way, as discussed previously, is to gather more realistic data for the minority class. If not applicable, then some syntactic techniques are available to create some new samples. One of these techniques is called SMOTE (Synthetic Minority Over-sampling Technique). The problem is that such generated samples are not realistic. The more non-realistic samples are used the worse the learning process will be.

Transfer learning overcomes this issue for the reason previously mentioned. Just use the few realistic samples to fine-tune the model. Of course, more samples are preferred to do this job but if there are no much samples then transferring the learning is the preferred option compared to learning from scratch.

4. Training a DL model from Scratch requires High Processing Power and takes much Time

The previous points discourage the idea of building a model from scratch but using transfer learning due to the insufficient amounts of data. Assuming there is too much data, does that mean building a model from scratch and not to transfer learning? Definitely NO. Transfer learning is not just selected when there is a few amount of data but because building a model from scratch requires machines with high processing power and large amounts of RAM. A machine with such specifications may not be available for all of us. Even if cloud computing is available, it might be costly for some people. This is why transfer learning might still being used even if there is a sufficient amount of data. The sufficient amount of data might help to well fine-tune the model and adapt it to the problem being solved.

The model is expected to be generic and then the engineer adapts it to the problem being solved. This is a matter of moving from a general case to a more specific case that serves the purpose well. Of course fine-tuning might not need much amount of data as used when building a model from scratch but still helpful to adapt the model well to the problem.

5. New Test Samples are not covered in the Training Data

For most of the problems solved using machine learning, the training and test data are similar and derived from the same distribution. The model produced based on this data will not find it difficult to be tested by a sample similar to the ones it was trained by. The problem is that some new samples may not be similar to the training data and may follow a bit different distribution.

Machine learning engineers deal with this issue by building a new model that handles such new samples. Being uncountable, it is not possible to change the behavior of the previously trained model for each group of samples with different characteristics than the ones used in training. The pre-trained model may be used into production and it is not possible to make changes to it each time some new samples are available.

Using transfer learning, the pre-trained model has already seen thousands or millions of samples that cover many of the cases that might exist in the test data. The possibility of seeing an unfamiliar sample in the future drops.

6. Building a Model from Scratch requires Researching the Problem and Deep Understanding of How Things Works

If a deep convolutional neural network (DCNN) is to be built by a researcher, the first step is to have a solid understanding of how an artificial neural network (ANN) works. As an extension to ANN, the researcher have to be familiar with how CNN works and its different types of layers. The researcher also has to create a CNN architecture for the problem being solved by stacking different layers together. This is a very challenging task that requires much time and effort to derive the best CNN architecture.

Using transfer learning, the researcher does not have to know everything because there is no need to build an architecture from scratch. Just care about a few numbers of parameters to be fine-tuned.

Use case in which Transfer Learning is Beneficial

Suppose there is a dataset of 2 classes, cats and dogs, and a CNN is to be created for that classification task. Time and effort may be spent creating an architecture that achieves high classification accuracy. If there is another task of classifying another dataset with 2 classes which are horses and donkeys, thus repeating the same work done for the cats-dogs classification is tiresome. Transfer learning is beneficial in this case. What was learned from the CNN trained by the cats-dogs dataset can be transferred to the other task of horses-donkeys classification. This saves much time starting again from scratch.

Conditions to use Transfer Learning

When transfer learning is used the right way, the results will be great. But there are misuses of transfer learning. Thus, it is important to highlight the main conditions for deciding whether I can use transfer learning for a pre-trained DL model or not. The conditions discussed are:

  1. Data Type Consistency
  2. Similarity in Problems Domains

Let’s discuss these 2 conditions.

1. Data Type Consistency.

Before transferring learning from one problem to another, the 2 problems must be consistent regarding the data being used. Data type means images, speech, text, etc.

If images are used for building a DL model, images must also be used when transferring the learning of such a model to the new problem. It is not correct to transfer what was learned using images to a new task that uses speech data. The features learned from the images are different from what should be learned from speech signals and vice versa.

2. Similarity in Problems Domains

Data consistency is a very important factor that must be valid before learning transfer. Other factors contribute to maximizing the benefit of using transfer learning. It is preferred to have similarities between the domains of the 2 problems. Not only speaking about the data type but also about how much similar the 2 problems. One problem may be about the classification of cats and dogs. The learning achieved from this problem seems applicable to another problem that classifies other 2 types of animals such as horses and donkeys.

Even if the domains are different, it is still possible to transfer such learning to a problem that classifies 2 types of tumors. Even if image data is used in both the binary classification tasks, the domains of the problems are different and still, transfer learning is applicable but with limited capabilities. In CNN, some layers learn generic features that could be applied to any type of problems. By going deeper in the CNN, the layers become specific to the task being solved. Because the cats-dogs dataset is similar in domain to the horses-monkeys dataset, many of the learned features by the model trained by either could be transferred to the other problem. In other words, the similarity extends to a deeper level in the CNN compared to a shallow level when such learning is transferred to a problem in a different domain.


This tutorial started by discussing the traditional machine learning pipeline and highlighted that manual feature extraction is challenging especially for large and complex datasets. Using deep learning, feature extraction is automated. But in order to build a deep learning model from scratch, a large dataset is needed. For using deep learning for automatic feature extraction from small datasets, transfer learning is the option.

In the next tutorial, Part 2 of the series, the practical side of the project will start by downloading, preparing, and analyzing the content of the Fruits360 dataset. By the end of Part 2, NumPy arrays will be created to hold all dataset image data in addition to their class labels. Such data will be fed later to the MobileNet to extract features after transfer learning.