DeepLearning series: How to structure machine learning projects

In this blog, I will explain how to structure a machine learning project and some useful techniques for deep learning, such as transfer learning, multi-task, and end-to-end learning.

In a previous blog I mentioned how many strategies and parameters are involved in a machine learning project. In particular, when we want to optimize the results of our algorithm we have several options such as:

  • Gather more data
  • Have a more diverse training set
  • Increase the number of iterations
  • Change optimization algorithm, such as Adam, RMSprop,..
  • Change type of network
  • Add regularization
  • Change network architecture

This can be daunting especially because taking one step could mean spending time only to realize it wasn’t a good choice to begin with.

What we want is quick advice on which idea is worth pursuing. That is, which knob we want to turn to achieve a particular effect. (This is what they call “orthogonalization”, which sounds intimidating in itself!)

Let’s start in order.

When tackling a machine learning project, the first thing we want is good performance on the training set (for some applications, like image recognition, this might mean reaching human-level performance). Then move on to fix the dev set, the test set and then finally perform well in the real world.

In summary, these are the things we can tweak to improve each of those steps:

– for the training set:

  • use a bigger network
  • change optimization algorithm (i.e., Adam)

– for the dev set:

  • use regularization
  • gather more data

– for the test set:

  • if things are not going well on the test set, we might need to revisit the dev set, as we might have over-tuned it.

– for the real world:

  • if things don’t go well here, then maybe the distribution of dev and test set might not be correct.
  • or the cost function might not be correct

You might be wondering, “how do I know if things are going well?”

Thanks for asking! Yeah, that’s right, we need a metric to evaluate things!

You might read in the published literature that to evaluate the performance of a classifier you look at the two evaluation metrics: precision and recall.

Let’s remind ourselves what they are.

For example, take an image recognition classifier for cats.

Precision, in this case, measures what percentage of the examples recognized as cats are actually cats. Recall, on the other hand, tells us what percentage of the actual cats, are correctly classified as cats.

The problem of using both of these evaluation metrics is that it is difficult to discern which classifier works better. One could have better results regarding precision and another concerning recall.

Wouldn’t it be best if we had only one evaluation metric? Maybe one that combines the two? Since you asked… yes! F1-score does that for us! It is essentially the average of precision (P) and recall (R), and it is calculated as:

Sometimes we want to consider other metrics as well (i.e., running time) and therefore one evaluation metric is not sufficient. In those situations, where we have several metrics, we can identify one of them to be the optimizing metric, and the rest consider as “satisficing”. The latter is essentially the “good enough” value that we need to satisfy.

For example, we want an algorithm to run less than 100ms. In that case, our “satisficing” metric (running time <=100ms) will filter out all the algorithms that are running above that time. For the ones that are, instead, running less than 100 ms. we will then consider the other (optimizing) metric to discern which algorithm fits our requirements.

Defining a metric to evaluate a classifier helps us place a target. How to do well on that metric is to aim and shoot at that goal.

Bias/Variance problem

I mentioned earlier that for some applications recognizing the good performance of an algorithm could be done comparing human level results.

That is true for natural perception applications, such as image recognition, for example, but for the ones that involve a lot of data and, additionally, more structured data, even current algorithms are performing better than humans.

So the correct benchmark should be the “Bayes optimal error”, which is the best possible error that can be achieved. In some cases, human error can be close to that but never above it.

Let’s evaluate the performance of our training/dev sets related to the Bayes error and see how we can identify the bias/variance problem, through the following example:

As we can see, the difference between the Bayes error and the training set denotes a problem related to high bias, while the gap between the training error and the dev set error is related to a high variance issue. So, fitting the training set lowers the avoidable bias, and when the training set performance generalizes well to the dev set, then we avoid high variance.

I have mentioned in a previous blog some of the techniques to overcome the bias/variance problem, but it’s worth repeating.

To fix bias:

  • train a bigger model
  • train longer
  • use better optimization algorithms (Momentum, Adam, RMSprop)
  • different neural network architecture / hyper-parameters search

To fix variance:

  • gather more data
  • use regularization (L2, dropout, data augmentation)
  • different neural network architecture / hyper-parameters search

The above bias/error analysis only stands if the training and dev/test sets are coming from the same distribution.

On the other hand, if we were analyzing the same error gap as above (training error 8.0% and dev set error of 12%) we could no longer say for certainty that the model was affected by high variance because two things are now in place:

  • the algorithm saw data in the training set that is different than the dev set
  • the distribution of the data is dissimilar

Luckily, we have an option to fix this!

We need to create a new subset of data (called “training-dev” set), which is a portion of the training set (therefore it has the same distribution of the training set), but is not used for training.

In this way, we can discern the problem related to the different distribution from the different data, and we can analyze the errors appropriately.

I guess it’s worth mentioning how to address the data mismatch.

Unfortunately, there are no systematic ways to do so. However, here are a couple of recommendations:

  • carry out a manual error analysis to understand the difference between training and dev/test sets.
  • make the training data more similar or collect data similar to the dev/test sets. For example, in a voice recognition system, if the dev/test sets have background noise, while the training data is a clear audio, we can use “artificial data synthesis” to add background noise to the audio.


In the following section I will touch on some aspects of learning that can be carried out when working on a deep learning project:

  • Transfer learning
  • Multi-task learning
  • End-to-end deep learning


In a way, it’s similar to how humans gather knowledge: learn from one task and apply it to others.

To be more specific, for a neural network, we delete the last output layer and its weights, and replace them with a new layer (or even several new layers) as well as a new set of randomly initialized weights for the last portion of the network.

At this point, we can retrain the new network on the new dataset.

To be more precise, we need differentiate between two cases: a new small dataset or a new big dataset. In the first situation, we only re-train the weights of the last layer and keep the rest of the parameters fixed. This is called “fine-tuning”.

On the other hand, when applying transfer-learning on a new, big dataset, we re-train all the parameters of the network.

If we are trying to learn from task A to B, then transfer-learning makes sense if:

  • Task A and B have the same input.
  • We have a lot more data for task A than B.
  • The low-level features from A could be helpful for learning task B.

For example, we might have a lot of data taken for image recognition. We can use that trained network as “transfer learning” for x-rays recognition, where instead we don’t have many images to be able to train a network from scratch.


We have seen how in transfer-learning we have a sequential process from A to B. In multi-task learning, instead, we start off with one neural network trying to simultaneously do several things at the same time. Each of these tasks is also helping the other tasks.

One example of multi-task learning is to train a neural network to recognize several objects from an image, such as recognizing pedestrians, cars, stop signs and traffic lights.

In this case, the output would be of 4 categories:

where Y can be:

(The question mark will be put in a position where it can’t identify that specific object).

The loss (for the entire training set m) will then be calculated over the values of the outputs (4 in this example):

So for each image, the output of the network tells us if the image contains a pedestrian/car/stop sign/traffic light.

You might be asking yourself, couldn’t I train four separates network?

Yes, you could have. But if some of the earlier features in the neural network can be shared among these different types of objects, then training one neural network to do four things, results in better performance than training four separates networks to do four separate tasks.

Multi-task learning makes sense when:

  • We are training on a set of tasks that could benefit from having shared lower-level features.
  • The amount of data we have for each task is quite similar.
  • We can train on a big enough network that does well on all the tasks.

On the other hand, if the network isn’t big enough, then having separate neural networks works better. Finally, multi-task learning is not used too often, except for applications related to object detection. Transfer learning, instead, is more widely used.


Some learning systems require multiple stages of processing. End-to-end learning, instead, takes all those multiple stages and replaces them with just a single neural network.

This type of learning is one of the most recent ones, and it is sometimes identified as a “black box” since we don’t pre-process data but let the network figure things out all by itself.

To understand what I mean, let’s see some examples.

  • Speech recognition:

With a multi-stages approach the process would be:

X (audio) -> features -> phonemes -> words -> Y (transcript)

And end-to-end method instead feeds the audio and gets the transcripts directly:

X (audio) -> Y (transcript)

  • Face recognition:

Multi-stages approach:

X (image) -> detect face -> zoom and crop -> cropped picture fed to NN -> Y (detect)


X (image) -> Y (detection)

When applying end-to-end learning, the key is to have a lot of data for the network to learn a function of the complexity needed to map X to Y.

The Pros of using this method are:

  • Let the data speak (instead of having human “preconceptions”. For example, we saw earlier that in speech recognition one of the steps is to create phonemes. This step turned out to be unnecessary for audio recognition.
  • Less “hand-designing” of components needed.

The cons:

  • We need a large amount of data.
  • It excludes potentially useful hand-designed components (in particular when we don’t have enough data from which the network can gain precious insight. Kind of a double-edged sword, huh?)

This blog is based on Andrew Ng’s lectures at

Source: Deep Learning on Medium