Understanding Epochs and Batches

Source: Deep Learning on Medium


The beginners’ guide to training a neural network

Epoch — the timed operation

After several years of machine learning, deep learning and Artificial Intelligence related courses, the two common cases of epochs and batches seemed to become a normal routine and need to be at the finger tips of every machine learning practitioners and researchers.

These two terms sound simple and they are normally used together, and beginners find it hard to identify what the role of any of them is in relation to deep learning or machine learning in general during the training of any artificial neural network. By artificial neural network, I am trying to generalize any form of network that is built, trained on some data.

This article serves to clarify this misconception and to give a proper understanding with different ways of hoe these two can be approached.

Epoch

By definition, Wikipedia defines an epoch in computing as

“an epoch is a date and time from which a computer measures system time”

but this definition does not give much ‘light’ in the field of Machine learning, deep learning and Artificial Intelligence.

In simple understandable description, for a given dataset meant for training the neural network,

“an epoch elapses when almost/ all the dataset has been used in the training of the network without repetition of each data value”.

I used ‘almost’ all in this because at times not all the training data is used during the training, and this will be explained further in the batch section of this article. As an example, if our training set contains 40,000 images, an epoch is completed when ‘almost’ all the 40,000 images have been used in the training.

Batches

Batches

In machine learning, we can decide to train our network on single records, a batch of records or all the records at a time. The following are the different forms of batch sizes or data loading techniques during training and/or testing the neural network.

  1. A batch size of one implies that one record is processed at a time during the training, and therefore the weights and biases are updated based on the gradients produced on this record. This consumes less memory however, the computation takes longer for a larger dataset.
  2. The mini batch (also known as batch) is a scenario where we decide how many records to process at a time, instead of just one. This takes more memory, however, it is more computationally fast and the weight-bias updates are performed after the whole batch has been processed. According to many deep learning researches, it is advisable to have batch sizes of 2 to the power of n, where n is an integer starting from 0, e.g. 16, 32, 64, 128… .
  3. Another way of loading data during training is by loading the whole data and performing the training at the same time. This however is not a good approach and consumes a lot of memory for loading of large datasets at the same time.

An article by Aeirin Kim on Difference between Batch Gradient Descent and Stochastic Gradient Descent gives a clear difference for the weight update techniques during the optimization of the neural network.

Batches have been found to perform well during the network training where a lot of training data is in use. In every epoch, the number of batches that need to be run, N is given by

N = ceiling(number of training / batch size)

An epoch therefore elapses after the N batches have been processed during the training phase.

One common mistake beginners make is to think that, in every training where the validation accuracy has been evaluated, an epoch has been completed. I have been asked this question several times, however, let me use this opportunity to clarify this.

To understand this concepts better, a neural network to classify F_MNIST dataset using Pytorch is designed. First things first, let us import the training data from torchvision and normalize it.

Loading Fashion MNIST data and normalizing it
Model Definition

We can then create the network and training loop for the dataset. In this case, we used Adam optimizer with a learning rate, lr=0.001, and an NLLLoss() function for calculating the loss.

Most of the articles and research papers present this form of experiments. After every epoch, the trained model is validated on the validation set to see how well it generalizes the data. This generates sample output such as shown below. Take note of the epoch labeling

Sample training output for the defined network

It should also be noted that, sometimes the last batch may not contain the number of images required to form a batch. We can verify this by checking the shape of the last batch.

Modified data looping code to count the batches
Number of images in each batch in the first epoch. The last batch has only 32 images while the others have 64 images

We can therefore choose to use this incomplete batch for training or discard it during the training and therefore maintaining the floor value of the N instead of the ceiling. We can do this by defining the drop_last parameter in the train/test loader. This parameter is set to False by default to accommodate the incomplete batch, and we can set it to true so as to leave out the last batch if its (batch size % batch_size) is not equal to 0

Dropping the last batch if it does not meet the predefined batch_size, i.e. 64

However, within an epoch, the validation test can also be performed based on the number of samples(or Batches) picked from the main data for training. In this case, we can use a count(step) on the batches to perform validation tests within an epoch. In the example below, the validation test was performed after every 5 batches were used for training as indicated in the red box.

Validation done within an epoch based on the number of batches loaded during training.
Batch

Not to forget, there is a bit of confusion beginners face too about the shuffle component of the dataloader. Let me use this simple scenario to give more light to the shuffle concept in the batch sized data loading. Shuffling is done at the time of loading the entire data. When picking out a batch, we obtain data from the already shuffled data, but the main aim during training is that we have to use all the training set as mentioned earlier. If we are to just shuffle batches (as misconceived by most of the beginners), then we might end up not training the network on some data. Shuffling is done to ensure that the training batch that we use consist of different samples. For the luck of a better simple example, consider having a data of [1, 2, 3, 4, 5, 6, 7], and a batch size of 2, the for the first shuffle, we may have [2, 1, 6, 4, 7, 5], the first batch would be batch1 = [2, 1], batch2 = [6, 4] and batch3 = [7, 5]. During the next iteration (epoch), we shuffle the data again, say we obtain [7, 4, 1, 3, 5, 2], then batch1 = [7, 4], batch2 = [1, 3] and batch3 = [5, 2]

Shuffling option enabled in the data loaders as as indicated by the red box, i.e, shuffle=True

Conclusion:

  1. The use of batches is essential in the training of neural networks with large data sets.
  2. The batch size should be reasonable not to consume a lot of space and not to be too small too for good performance.
  3. An epoch is complete when all the data in a given set has been fully accessed for training.
  4. Validation testing can be performed within an epoch and not only after an epoch has elapsed.
  5. The last batch can be used for training by default or can be chosen to be left out.

I hope this article has given you an in depth understanding and distinction between the epochs and batch size and how they can be used in different experimentation approaches.