How to Solve the Common Problems in Image Recognition

How to Solve the Common Problems in Image Recognition


Most classification problems related to image recognition are plagued with well-known and established problems. For example, frequently there won’t be enough data to properly train a classification system, the data might have some underrepresented classes, and most commonly, working with unscrutinised data will imply working with poorly labelled data.

Data is the key that determines whether your efforts will fail or succeed. These systems don’t just need more data than humans to learn and distinguish different classes, they need thousands of times more to do the job.

Deep learning relies on enormous amounts of high-quality data to predict future trends and behaviour patterns. The data sets need to be representative of the classes that we intend to predict, otherwise, the system will generalise the skewed classes distribution, and the bias will ruin your model.

These problems normally will share a common cause; the ability to find, extract, and store large quantities of data, and on a second level, cleanse, curate, and process that data.

While we can increase computing power and data storage capacity, one machine will not stand a chance when running a complex and large convolutional neural network against a large data set. It might not have enough space and, most likely, will not have enough computing power to run the classification system. It will also require access to parallel/distributed computing through cloud resources, and to understand how to run, organise and set complex clusters.

Yet, having enough data and the power to process is not enough to prevent these problems.

In this post, we’ll explore and discuss different techniques that can address the problems that arise when working with small data sets, how to mitigate class imbalance, and how to prevent over-fitting.

Transfer Learning

Data might be the new coal, quoting Andrew Ng, and we know that deep learning algorithms need large sets of labelled data to train a fully-fledged network from scratch, but we often fail to fully comprehend how much data that means. Just finding the amount of data that meets your needs might be an endless source of frustration, but there are some techniques, such as a data augmentation or transfer learning, that will save you a lot of energy and time to find data for your model.

Transfer learning is a popular and very powerful approach which in short can be summed up as the process of learning from a pre-trained model that was instructed by a larger data set. That means leveraging an existing model and changing it to suit your own goals. This method involves cutting off the last few layers of a pre-trained model and retraining them with your small data set. It has the following advantages:

  • It creates a new model over an older one with verified efficiency for image classification tasks. For example, a model can be built upon a CNN architecture such as Inception-v3 (a CNN developed by Google) and pre-trained with ImageNet;
  • It reduces the training time as it allows the reuse of parameters to achieve a performance that could take weeks to reach.

Unbalanced Data

Often the proportion of a group of labels inside a data set versus the others can be unbalanced and it’s often that this minority group of labels is the set of categories that we are interested in precisely for its rarity. For example, suppose we have a binary classification problem, class X represents 95% of the data and class Y the other 5%. Thus, the model is more sensitive to class X and less sensitive to class Y. As the classifier reaches an accuracy of 95% it will basically predict class X every time.

Clearly accuracy here is not an appropriate scoring. In this situation, we should consider the cost of the errors, the precision, and the recall. A sensible starting point is a 2-D representation of the different types of errors, in other words, a confusion matrix. In the context of the outcome of our classification, it can be described as method to illustrate the actual labels versus the label prediction, as illustrated in the below diagram.

By storing the number for each label of true positives, false positives, true negatives and false negatives acquired from the model’s predictions, we can estimate the performance by label using recall and precision. Precision is defined as the ratio:

Recall is defined as the ratio:

Recall and/or precision will disclose an underlying problem, but not solve it. However, there are different methods to mitigate the problems associated with a marked imbalance in the distribution of classes:

  • By assigning distinct coefficients to each label;
  • By resampling the original dataset, either by oversampling the minority class and/or under-sampling the majority class. That said, oversampling can be prone to over-fitting as classification boundaries will be more strict and small data sets will introduce bias;
  • By applying the SMOTE method (Synthetic Minority Oversampling Technique) which alleviates this problem replicating the data of less frequent classes. This method applies the same ideas behind data augmentation and makes it possible to create new synthetic samples by interpolating between neighbouring instances from the minority class.


As we know our model learns/generalises key features on a data set through backpropagation and by minimising a cost function. Each step back and forth is called an epoch, and with each epoch the model is trained and the weights are adjusted to minimise the cost of the errors. In order to test the accuracy of the model, a common rule is to split the data set into the training set and the validation set.

The training set is used to tune and create the model that embodies a proposition based on the patterns underlying in the training set, the validation set tests the efficiency and validation of the model based on unseen samples.

Albeit the change on the validation error for a real case tends to show more jumps and downs:

At the end of each epoch we test the model with the validation set, and at some point the model starts memorising the features in the training set, while the cost error and the accuracy for the samples on the validation set gets worst. When we reach this stage, the model is overfitting.

Selecting how large and complex the network should be will be a determinant cause for overfitting. Complex architectures are more prone to overfitting but, there are some strategies to prevent it:

  • Raising the number of samples on the training set; if the network is trained with more real cases it will generalise better;
  • Stopping the backpropagation when overfitting happens is another option, which implies checking the cost function and the accuracy on the validation set for each epoch;
  • Applying a regularisation method is another popular choice to avoid overfitting.

L2 Regularisation

L2 regularisation is a method that can be used to reduce the complexity of a model by assigning a constraint to larger individual weights. By setting a penalty constraint we decrease the dependence of our model on the training data.


Dropout is a common option too for regularisation, it’s used on the hidden units of higher layers, so that we end up with different architectures for each epoch. Basically, the system randomly selects Neurons to be removed during the training. As a consequence, by constantly rescaling the weights the network is forced to learn more general patterns from the data.


As we’ve seen there are various different methods and techniques to solve the most common classification problems in image recognition, each with their benefits and potential drawbacks. There are problems such as Unbalanced Data, Over-fitting, and quite frequently there will not be enough data available but, as we’ve explained their effect can be mitigated with transfer learning, sampling methods, and regularization techniques.

This is an area that we continue to explore as we develop our own Imaginize image recognition technology. This new product feature has been designed to help our eCommerce customers improve the classification, tagging and findability of their products through being able to automatically identify and recognise colours and categories.

Source: Deep Learning on Medium