Siamese Networks for Visual Tracking

Source: Deep Learning on Medium

What are Siamese Networks?

Siamese Networks are a part of the larger Artificial Neural Networks family that usually consist of two or more identical networks. They function in a way that there is a weight-sharing between nets while working one after the another on two different input vectors to compute comparable output vectors.

To illustrate the idea of Siamese network’s functioning, one can refer to the following diagram.

Figure 1: Siamese Networks

Each sister network of the siamese network is fed with an image and the neural network is trained using triplet loss or contrastive loss.

What is Constrastive Loss?

The goal here isn’t classification, hence the most intuitive loss functions for convolutional neural networks like, mean squared error or cross entropy doesn’t work in this scenario. To differentiate between the input images, we need a loss function that can contrast between them. Hadsell et al. have very well presented the idea of such a contrastive loss in their literature.

The Constrastive loss can be computed using the following equation.

Where Dw is nothing but the standard Euclidean distance between the outputs of the two sister neural networks that form the siamese network. Hence the goal is the learn a distance metric that can tell how far apart are the inputs. Thus the loss function can also be looked simplistically as

Where i and j are set of input vectors/images and f(x) is the distance function like the Euclidean Distance.

Real-Time Visual Tracking

Zhang et al. have in their recent CVPR 2019(accepted) paper demonstrated the idea of utilizing siamese networks for Real-time visual tracking. Previously Siamese Networks were explored for this task owing to the balanced tradeoff provided by them for accuracy vs speed. However, it was noted that they usually made use of shallow neural nets like AlexNet. In this article, I have demonstrated the script for a Fully Convolutional Siamese Network with a ResNet backbone.


To get the dataset and supporting files Click Here

Results as reported by [2]

The results have been demonstrated on the OTB dataset.

The Fully Convolutional Siamese Network provides a Precision of 0.88 on the OTB 2013 Dataset with AUC of 0.67.


I would like to Acknowledge the Intel Student Ambassador program which provided me with the necessary platform and training to explore various aspects of AI and Deep Learning.

For the purpose of optimized implementation of this article, I have made use of the Intel Distribution for Python and the Intel AI DevCloud.


  1. R. Hadsell, S. Chopra, Y. LeCun, “Dimensionality reduction by learning an invariant mapping”, Proc. CVPR, 2006
  2. Z. Zhang, H. Peng, “ Deeper and Wider Siamese Networks for Real-Time Visual Tracking”,
  3. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  4. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016