Original article was published on Artificial Intelligence on Medium
To train the Siamese Network, we have to first generate the proper input (in pairs) and define the ground truth label for the model.
We first define two images that are from the same character in the same alphabet to have a similarity of 1, and 0 otherwise as shown in Figure 3. Afterwards, we randomly select a pair of images to input into the network based on parity of the index on the dataloader iteration. In other words, if the current iteration is an odd number, we retrieve a pair of images from the same character, and vice versa. This ensures that our training dataset is balanced for both types of outputs. Both images go through the same image transformation, since the goal is to determine the similarity of the two images, so feeding them into different image transformations wouldn’t make sense.
The following is the code for generating the training set:
We created 10000 pairs of these data as our training set, which is then separated further into training and validation with an 80:20 ratio randomly.
The evaluation of a network on its performance in one-shot learning can be done via an n-way one shot learning evaluation metrics, where we find n images representing n categories and one main image that belongs to one of the n categories. For our Siamese Network, we computed the similarity of the main images against all n images, and the pair with the highest similarity means the main image belongs to the class.
The test loader was structured in the way to support the above evaluation, where a random main image is taken and n images representing n categories were retrieved as well, one of which is from the same category of the main image.
The following is the code for generating the test set:
For our final testing, we extended our network to 4-way one shot learning with a test set size of 1000, and a 20-way with a size of 200.
Experiment 1. Traditional Siamese Network for one-shot learning
The major part of the Siamese Network is the double convolutional architecture that was shown previously. The first convolutional architecture we will try to build was from Koch et al. in his paper “Siamese Neural Networks for One-shot Image Recognition” , as portrayed in Figure 4. One thing to note is that after flattening, the absolute differences between the two convolutional branches are fed into the fully-connected layer instead of just one image’s input.
The network in PyTorch is built as the following:
and we can perform training with the following function:
Batch Size: Since we are learning how similar are two images, the batch size needs to be pretty big in order for the model to be generalisable especially for a dataset like this with many different categories. Therefore we used a batch size of 128.
Learning Rate: We tested with several learning rates from 0.001 to 0.0005, and selected a 0.0006 which provided the best loss decreasing rate.
Optimizer and Loss: We adopted the traditional Adam optimizer for this network with the binary cross entropy (BCE) loss with logits.
The network is trained for 30 epochs. Figure 3. is the plot of the training and validation loss after every epoch, which, as we can see, shows a dramatic decrease and and convergence towards the end. The validation loss decreases generally along with the training loss, indicating that no overfitting has occurred throughout the training. During the training, the model with the lowest validation loss will be saved . We used the validation loss instead of training loss as it is an indicator that the model is not just performing well only on the training set, which is likely to be a case of overfitting.
Experiment 2. Adding Batch Normalisation
To further improve the network, we can add batch normalisation, which supposedly is going to make the converging process faster and more stable. Figure 4 is the updated architecture with a BatchNorm2d after every convolutional layer.
As expected, the loss decreased a lot faster for both training loss and validation loss, in comparison to the original network. With a better result, we decided to also train the model for more epochs to see whether it would perform better than experiment 1.
As shown in the loss graph, the results were slightly better than the original result from Experiment 1. Since the loss is slowly converging between epochs 40 and 50, we stopped training at the 50th epoch. This is currently the best result we have achieved.
Experiment 3. Swapping the ConvNet with a lightweight VGG16
After getting the original network to work pretty well, we can also test out different well-established CNNs for our Siamese Network, and see if we can achieve better results. With the small image size of 105×105, we wanted to use a network that is comparatively smaller with not that many layers, yet still produced decent results, and hence we borrowed the network architecture of VGG16.
The original VGG16 was still a bit too big for our size, where the final 5 convolutional layers are just dealing with single pixels, and so we eliminated them, ending up with the network as the following:
As shown in the loss graph, the training loss is decreasing significantly slower than the prior experiments. This could be due to the fact that the kernel size of the convolutional layers is fairly small (3×3), which gives a small receptive field. For a problem of computing similarity between two images, it may perhaps be beneficial to look at a “bigger picture” of the two images instead of focusing on small details, and hence a larger receptive field proposed in the original network worked better.
Evaluation on the Model
The code for evaluating a network is implemented as the following:
4-way one shot learning
We first tested a 4-way one shot learning using a completely new set of images for evaluation, where all the testing images were not used during training, and no characters were known to the model either. The results showed an approximately 90% accuracy, which suggests that the model generalized pretty well to unseen datasets and categories, achieving our goal of one-shot learning on the Omniglot dataset.
20-way one shot learning
Afterwards, we performed a 20-way one shot learning evaluation for 200 sets. Where the result returned to still be around 86%. We compared the results with the baselines provided by the Lake et al.: