Transfer Learning in Image Classification: how much training data do we really need?

Original article was published on Artificial Intelligence on Medium


Experimental Case Study

The task chosen for experimenting Transfer Learning consists of the classification of flower images into 102 different categories. The choice of this task is mainly due to the easy availability of a flowers dataset, as well as to the domain of the problem, which is generic enough to be suitable for effectively applying Transfer Learning with neural networks pre-trained on the well-known ImageNet dataset.

The adopted dataset is the 102 Category Flower Dataset created by M. Nilsback and A. Zisserman [3], which is a collection of 8189 labelled flowers images belonging to 102 different classes. For each class, there are between 40 and 258 instances and all the dataset images have significant scale, pose and light variations. The detailed list of the 102 categories together with the respective number of instances is available here.

Figure 1: Examples of images extracted from the 102 Category Dataset.

In order to create training datasets of different sizes and evaluate how they affect the performance of the trained networks, the original set of flowers images is split into training, validation and test sets several times, each time adopting different split percentages. Specifically, three different training sets are created (that from now on will be referred to as the Large, Medium and Small training sets) using the percentages shown in the table below.

Table 1: number of examples and split percentages (referred to the complete unpartitioned flowers dataset) of the datasets used to perform the experiments.

All the splits are performed adopting stratified sampling, in order to avoid introducing sampling biases and ensuring in this way that all the obtained training, validation and test subsets are representative of the whole initial set of images.

Adopted strategies

The image classification task described above is addressed by adopting the two popular techniques that are commonly used when applying Transfer Learning with pre-trained CNNs, namely Feature Extraction and Fine-Tuning.

Feature Extraction

Feature Extraction basically consists of taking the convolutional base of a previously trained network, running the target data through it and training a new classifier on top of the output, as summarized in the figure below.

Figure 2: Feature Extraction applied to a convolutional neural network: the classifiers are swapped while the same covolutional base is kept. “Frozen” means that the weighs are not updated during training.

The classifier stacked on top of the convolutional base can either be a stack of fully-connected layers or just a single Global Pooling layer, both followed by Dense layer with softmax activation function. There is no specific rule regarding which kind of classifier should be adopted, but, as described by Lin et. al [2], using just a single Global Pooling layer generally leads to less overfitting since in this layer there are no parameters to optimize.

Consequently, since the training sets used in the experiments are relatively small, the chosen classifier only consists of a single Global Average Pooling layer which output is fed directly into a softmax activated layer that outputs the probabilities for each of the 102 flowers categories.

During the training, only the weights of the top classifiers are updated, while the weights of the convolutional base are “frozen” and thus kept unchanged.

In this way, the shallow classifier learns how to classify the flower images into the possible 102 categories from the off-the-shelf representations previously learned by the source model for its domain. If the source and the target domains are similar, then these representations are likely to be useful to the classifier and the transferred knowledge can thus bring an improvement to its performance once it is trained.

Fine-Tuning

Fine-Tuning can be seen as a further step than Feature Extraction that consists of selectively retraining some of the top layers of the convolutional base previously used for extracting features. In this way, the more abstract representations of the source model learned by its last layers are slightly adjusted to make them more relevant for the target problem.

This can be achieved by unfreezing some of the top layers of the convolutional base, keeping frozen all its other layers and jointly training the convolutional base with the same classifier previously used for Feature Extraction, as represented in the figure below.

Figure 3: Feature Extraction compared to Fine-Tuning.

It is important to point out that, according to F. Chollet, the top layers of a pre-trained convolutional base can be fine-tuned only if the classifier on top of it has already been previously trained. The reason is that if the classifier was not already trained, then its weights would be randomly initialized. As a consequence, the error signal propagating through the network during training would be too large and the unfrozen weights would be updated disrupting the abstract representations previously learned by the convolutional base.

For similar reasons, it is as well recommended to perform the fine-tuning adopting a lower learning rate than the one used for feature extraction.

Moreover, it is interesting to mention that the reason why only the top layers are unfrozen is that the lower layers refer to generic problem-independent features, while the top ones refer to problem-dependent features that are more linked to the specific domain for which the network has originally been trained. Consequently, the features learnt by the first layers are generally suitable for addressing a vast set of domains, while the features learnt by the top layers need to be adjusted for each specific domain.

Experiments

All the experiments have been developed and performed on Google Colaboratory cloud platform using Keras with Tensorflow 2.0 as backend.

ResNet50-v2 is the source network chosen for performing the experiments and the ImageNet dataset is the source domain on which it has been pre-trained. The choice of this network is totally arbitrary since for the aim of the experiments any other State-of-the-Art network pre-trained on a large dataset could have been used as well (I just decided to pick up a ResNet-based network among the several pre-trained models provided by the Keras applications module).

The following experiments have been carried out for each of the generated training dataset defined at the beginning of this story (e.g. Large, Medium and Small sets):

  1. Feature Extraction
  2. Fine-Tuning
  3. Feature Extraction with Data Augmentation
  4. Fine-Tuning with Data Augmentation

Within each experiment, the loaded images are always resized to 224 by 224 pixels. This represents the only pre-processing operation applied to the images of the datasets.

Data Augmentation

Data augmentation is a technique that consists of “artificially increasing the size of the training dataset by generating many realistic variants of each training instance” [1]. In the context of the performed experiments, this is enforced through four simple image processing operations:

  1. Random cropping with a minimum crop dimension equal to 90% of the original image dimension
  2. Random mirroring on either vertical and horizontal axes
  3. Random brightness adjustment with a maximum brightness delta of 0.2

Since Tensorflow is used as backend, all the operations defined above are implemented using the tf.image module provided by the framework, which easily integrates with the tf.data API adopted for building the input pipelines that feed data to the developed models.

Feature Extraction

With Keras, this classifier stacked upon a pre-trained ResNet can be easily implemented as follows:

conv_base = tf.keras.applications.ResNet50V2( 
include_top=False,
weights='imagenet',
input_shape=(IMAGES_SIZE, IMAGES_SIZE, 3),
pooling='avg'
)
model = Sequential()
model.add(conv_base)
model.add(Dense(len(np.unique(labels)), activation='softmax'))

Note that by passing the arguments ‘None’ and ‘avg’ to, respectively, the ‘include_top’ and ‘pooling parameters’, the ResNet50V2 class already builds a network replacing the last softmax layer with a Global Average Pooling layer.

In order to perform Feature Extraction, it is necessary to freeze the weights of the pre-trained convolutional base so that they do not get updated during the training of the overall model. This can be done by simply setting the property “trainable” to False for each layer of the convolutional base:

for layer in conv_base.layers:
layer.trainable = False

The comprehensive model which is created is thus the following:

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
resnet50v2 (Model) (None, 2048) 23564800
_________________________________________________________________
dense (Dense) (None, 102) 208998
=================================================================
Total params: 23,773,798
Trainable params: 208,998
Non-trainable params: 23,564,800

As expected, the model has only 208,998 trainable weights, which correspond to the weights of the last softmax activated fully-connected layer.

In all the performed experiments, the model defined above is trained for just 30 epochs, using a batch size of 16, Adam optimizer with learning rate 1e-4, sparse categorical crossentropy loss and early stopping callback with patience 10 (monitoring the validation loss metric).

Fine-Tuning

Fine-Tuning can be implemented just by unfreezing the last few layers of the model previously used for feature extraction and then re-training it with a lower learning rate.

The choice of how many layers should be unfrozen depends on how much the
source and target domains and tasks are similar. In this case, since the flower domain is not that different from the ImageNet domain, it is reasonable to unfreeze only the last two layers of the convolutional base, which in case of ResNet50-v2 are “conv5_block3” and “post” layers. This can be done as follows:

for layer in conv_base.layers:
if layer.name.startswith('conv5_block3'):
layer.trainable = True
if layer.name.startswith('post'):
layer.trainable = True

Once it is compiled, the overall model that is going to be fine-tuned looks like this:

Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
resnet50v2 (Model) (None, 2048) 23564800
_________________________________________________________________
dense_2 (Dense) (None, 102) 208998
=================================================================
Total params: 23,773,798
Trainable params: 4,677,734
Non-trainable params: 19,096,064

It is possible to note that even by unfreezing just the last two layers of the convolutional base, due to the architecture of ResNet50, the number of trainable parameters is remarkable. This is a further reason for limiting the unfrozen layers to the last two, since unfreezing more layers causes an even higher number of trainable weights that would likely lead to overfitting, given the limited size of the datasets on which the model is fine-tuned.

In all the performed experiments, the model is trained using the same configuration previously adopted for Feature Extraction, with the only difference that this time the model is trained for more epochs with a learning rate of 1e-5 (which is 10 times lower than the one used for Feature Extraction).

Assessment metric

The metric chosen for evaluating the performance of the trained classifiers is the F1 score, which corresponds to the harmonic mean of precision and recall and represents a simple and effective way to compare the performance of two different classifiers.

Since we are dealing with a multi-class classification problem, the average precision and recall over all the classes are used for computing the F1 score. These average values can be computed in two different ways: through microaveraging or macroaveraging.

With microaveraging, the average precision and recall values are computed considering the True Positives (TP), False Positives (FP) and False Negatives (FN) of all the classes. With macroaveraging instead, precision and recall are first evaluated for every single class and then the respective mean values are computed by averaging the results obtained for the different classes.

The figure below clarifies the difference between micro and macro averaging by showing the respective equations required for computing the average precision (average recall can be computed analogously).

Figure 4: equations for computing the average precision through, respectively, micro and macro averaging.

Even though microaveraging is more computationally expensive, it emphasizes the ability of a model to behave well on categories with lower generality (e.g. fewer examples), while macroaveraging mitigates it [4]. For these reasons, since the 102 flower categories have different generalities, all the F1 scores presented in the next section are computed using microaveraging, so that if a model performs poorly for a class with fewer examples, this will be emphasized in its final average F1 score.

Results

The figure below summarizes the results obtained with the three different datasets when applying “plain” Feature Extraction and Fine-Tuning, without using any Data Augmentation strategy.

Figure 5: F1 scores (micro-averaged) obtained on the three training datasets applying feature extraction and fine-tuning, without adopting Data Augmentation.

As expected, for all the datasets, the results obtained by fine-tuning ResNet are superior to the ones obtained with mere Feature Extraction. Indeed, by fine-tuning the pre-trained convolutional base, the weights of its last layers associated with the domain-specific features are nudged, so that the representations previously learned by the network on the ImageNet domain are adapted to the new flowers domain and consequently they are made more representative and effective, leading thus to higher F1 scores.

The most significant result is that, simply by fine-tuning the ResNet50-v2 on the smallest dataset composed by only around 800 examples, it was still possible to reach a micro-averaged F1 score of 0.79.

This result translates into the fact that, through Transfer Learning, the developed model was able to learn how to classify (with fair accuracy) flower images into the 102 possible categories just by seeing, on average, only 8 images for each category.

The next figure shows the F1 scores obtained performing the same experiments as before, but this time using the Data Augmentation techniques previously described.

Figure 6: F1 scores (micro-averaged) obtained on the three datasets applying feature extraction and fine-tuning, adopting Data Augmentation.

From the two charts represented in Figure 4 and Figure 5, it is possible to note that Data Augmentation does not bring any benefit when performing feature extraction, since the F1 scores for all the datasets remain the same. Nevertheless, this is likely related to the fact that the classifiers trained with Feature Extraction on all the three datasets were trained for only 30 epochs.

Table 1: number of examples and split percentages (referred to the complete unpartitioned flowers dataset) of the datasets used to perform the experiments.

Recalling the sizes of the Large, Medium and Small sets, shown once again in the table above for convenience, it emerges that the F1 scores obtained on the Small training set are more meaningful compared to the F1 scores obtained on the other ones, since the test set associated to the Small training set is, respectively, two and ten times larger than the ones associated with the Medium and Large sets. This is a further reason for assuming that the classifier trained with only 800 images (approximately) is actually performing well.

An insight on how much the lack of training data can affect the classifiers trained through Transfer Learning is given by the chart depicted in the figure below.

Figure 7: F1 score and traning dataset size decrement percentages for the Medium and Small datasets compared to the Large dataset.

The blue bars of the chart represent the F1 score decrement percentage of the classifiers trained on the Medium and Small datasets compared to the one trained on the Large one (considering only the best F1 scores obtained for each dataset) while the orange bars correspond to the training set size decrement percentage of the Medium and Small datasets compared to the Large one.

The chart highlights how the classifiers trained through Transfer Learning are particularly robust to the lack of training data:

Reducing the size of the dataset on which the classifier is trained by 50% caused an F1 score decrement of only 2%, while reducing the size of the dataset by 87% lead to a worsening of the F1 score of just 14%.

Finally, it is also interesting to note how Data Augmentation basically does not bring any performance improvement at all for the Large and Medium datasets, while it allows us to reach a slightly higher F1 score on the Small dataset, but only when fine-tuning is performed. This is more clear from the chart represented in the figure below, which shows the F1 improvement percentage brought by Data Augmentation when performing Fine-Tuning on the three different training sets. At the end of the day, the maximum F1 improvement brought by Data Augmentation is only 2.53%.

Figure 8: F1 improvement percentages of Fine-Tuning with Data Augmentation compared to just “plain” Fine-Tuning.

Conclusions

Bringing up again the question asked at the beginning of the story, regarding how many training examples are really necessary in order for Transfer Learning to be effective, given the results of the performed experiments, it is possible to answer that, in this specific case study, just ten examples per class are more than enough. Indeed, even when adopting the Small dataset with only about 800 training examples, it was still possible to train a classifier with a remarkable accuracy over the possible 102 classes, which performance is not so far from the one of the classifier trained on the larger datasets.

The results, therefore, confirm the effectiveness of Transfer Learning when very little data is available, showing how the performance of a classifier trained with a Transfer Learning approach is only marginally affected by the size of the dataset on which it is trained. For this reason, Data Augmentation does not heavily impact the performance of a classifier trained through Feature Extraction or Fine-Tuning, even though it still brings a slight improvement and can thus be considered worthy of use, especially given that it is relatively fast and simple to implement.

It is however necessary to point out that the dataset adopted for performing the experiments plays a decisive role in the excellent performance of the classifiers trained through Transfer Learning, since the flower domain of the selected dataset does not differ too much from the domain of the ImageNet dataset on which the convolutional base has been pre-trained (even though just a few classes of the 102 Category Flower Dataset are included in the ImageNet classes set). If the experiments were carried out using a dataset belonging to a specific domain completely different from the ImageNet one (like, for instance, a dataset consisting of a collection of labelled X-Rays photographs), then the representations learnt by the pre-trained network probably would not have been useful to the classifier trained on the target dataset, leading thus to worse performance regardless the size of the training set.

Summing up, assuming that the target domain is similar to the domain on which the adopted convolutional base has been pre-trained, Feature Extraction and Fine-Tuning allow achieving high performance even on an extremely limited dataset, making in this way Transfer Learning preferable to a training-from-scratch when little training data is available.